# Distributed Planning for Serving Cooperative Tasks with Time Windows: A Game Theoretic Approach

We study distributed planning for multi-robot systems to provide optimal service to cooperative tasks that are distributed over space and time. Each task requires service by sufficiently many robots at the specified location within the specified time window. Tasks arrive over episodes and the robots try to maximize the total value of service in each episode by planning their own trajectories based on the specifications of incoming tasks. Robots are required to start and end each episode at their assigned stations in the environment. We present a game theoretic solution to this problem by mapping it to a game, where the action of each robot is its trajectory in an episode, and using a suitable learning algorithm to obtain optimal joint plans in a distributed manner. We present a systematic way to design minimal action sets (subsets of feasible trajectories) for robots based on the specifications of incoming tasks to facilitate fast learning. We then provide the performance guarantees for the cases where all the robots follow a best response or noisy best response algorithm to iteratively plan their trajectories. While the best response algorithm leads to a Nash equilibrium, the noisy best response algorithm leads to globally optimal joint plans with high probability. We show that the proposed game can in general have arbitrarily poor Nash equilibria, which makes the noisy best response algorithm preferable unless the task specifications are known to have some special structure. We also describe a family of special cases where all the equilibria are guaranteed to have bounded suboptimality. Simulations and experimental results are provided to demonstrate the proposed approach.

• 7 publications
• 5 publications
• 7 publications
08/15/2019

### Distributed Path Planning for Executing Cooperative Tasks with Time Windows

We investigate the distributed planning of robot trajectories for optima...
03/04/2015

### Game-theoretic Approach for Non-Cooperative Planning

When two or more self-interested agents put their plans to execution in ...
10/29/2021

### Game Transformations that preserve Nash Equilibrium sets and/or Best Response sets

In the literature on simultaneous non-cooperative games, it is well-know...
12/08/2020

### Resolving Implicit Coordination in Multi-Agent Deep Reinforcement Learning with Deep Q-Networks Game Theory

We address two major challenges of implicit coordination in multi-agent ...
01/12/2021

### Survival of the strictest: Stable and unstable equilibria under regularized learning with partial information

In this paper, we examine the Nash equilibrium convergence properties of...
02/24/2021

### Mobile Recharger Path Planning and Recharge Scheduling in a Multi-Robot Environment

In many multi-robot applications, mobile worker robots are often engaged...
04/01/2022

### To Explore or Not to Explore: Regret-Based LTL Planning in Partially-Known Environments

In this paper, we investigate the optimal robot path planning problem fo...

## 1 Introduction

Multi-robot systems have proven to be effective in various applications such as precision agriculture, environmental monitoring, surveillance, search and rescue, manufacturing, and warehouse automation (e.g., Gonzalez-de-Santos et al. (2017); Seyedi et al. (2019); Li et al. (2019); Kapoutsis et al. (2017); Gombolay et al. (2018); Claes et al. (2017)). In many of these applications, the robots need to serve some cooperative tasks that arrive at certain locations during specific time windows. One major requirement for achieving the optimal team performance in such scenarios is to have properly coordinated plans (trajectories) so that sufficiently many robots are present at the right locations and times.

This paper proposes a game theoretic solution to the DTE problem by designing a corresponding game and utilizing game theoretic learning to drive the robots to joint plans that maximize the global objective function, i.e., the total value of service provided to the tasks. Similar game theoretic formulations were presented in the literature to achieve coordination in problems such as vehicle-target assignment (e.g., Arslan et al. (2007)), coverage optimization (e.g., Yazıcıoğlu et al. (2013, 2017); Zhu and Martínez (2013)), and dynamic vehicle routing (e.g., Arsie et al. (2009)). In our proposed method, we map the DTE problem to a game where the action of each robot is defined as its trajectory (plan) in an episode. We show that some feasible trajectories can never contribute to the global objective in this setting, irrespective of the trajectories of other robots. For any given set of tasks, by excluding such inferior trajectories, we obtain a game with a minimal action space that not only contains globally optimal joint plans but also facilitates fast learning. We then provide the performance guarantees for the cases where all the robots follow a best response or noisy best response algorithm to iteratively plan their trajectories in a distributed manner. While the best response algorithm leads to a Nash equilibrium, the noisy best response algorithm leads to globally optimal joint plans with high probability when the noise is small. We then show that the resulting game can in general have arbitrarily poor Nash equilibria, which makes the noisy best response algorithm preferable unless the task specifications have some special structure. We also describe a family of special cases where all the Nash equilibria are guaranteed to be near-optimal and the best response algorithm may be used to monotonically improve the joint plan and reach a near-optimal solution. Finally, we present simulations and experiments to demonstrate the proposed approach.

This paper is a significant extension of our preliminary work in Bhat et al. (2019) with the following main differences: 1) We extend the problem formulation to accommodate a broader range of tasks compared to Bhat et al. (2019), which only considered tasks that can be completed in one time step when there are sufficiently many robots. In this modified setting, the tasks are also allowed to change over time and their specifications are available to the robots before each episode. 2) We facilitate faster learning by designing significantly smaller action spaces based on the specifications of incoming tasks. We use different learning algorithms, define the information needed by each agent to follow these algorithms, and provide a price of anarchy analysis. 3) We provide new theoretical results, numerical simulations, and experiments on a team of drones.

The organization of this paper is as follows: Section 2 presents the DTE problem. Section 3

provides some game theory preliminaries. Section

4 presents the game design. Section 5 is on the learning dynamics and performance guarantees. Simulation results are presented in Section 6. Section 7 presents the experiments on a team of quadrotors. Finally, Section 8 concludes the paper.

## 2 Problem Formulation

This section presents the distributed task execution (DTE) problem, where a homogeneous team of mobile robots, , need to plan their trajectories in each episode to optimally serve the incoming cooperative tasks with time windows.

### 2.1 Notation

We use (or ) to denote the set of integers (or positive integers) and (or

) to denote the set of real (or positive real) numbers. For any pair of vectors

, we use (or ) to denote the element-wise inequalities, i.e., (or ) for all .

### 2.2 Formulation

We consider a discretized environment represented as a 2D grid,

 P={1,2,…,¯x}×{1,2,…,¯y}, (1)

where denote the number of cells along the corresponding directions. In this environment, some cells may be occupied by static obstacles and the robots are free to move over the feasible cells . Each cell in the grid represents a sufficiently large space that can accommodate any number of robots at the same time. There are stations located at a subset of the feasible cells . Robots recharge and get ready for the next episode at their assigned stations. Each robot is assigned to a specific station (multiple robots can be assigned to the same station), where its trajectory must start and end in each episode.

Each episode consists of time steps and the trajectory of each robot over an episode is denoted as . The robots can move to any of the feasible neighboring cells within one time step. Accordingly, when a robot is at some cell , at the next time step it has to be within ’s neighborhood on the grid , which is given as

 N(p)={(x′,y′)∈PF∣|x′−x|≤1,|y′−y|≤1}. (2)

For any , the set of feasible trajectories, , is

 Pi={pi∣p0i=pTi=σi,pt+1i∈N(pti),∀t

where is the location that contains the station of robot . As per (3), a trajectory is feasible if it satisfies two conditions: 1) it starts and ends at the assigned station , and 2) each position along the trajectory is in the neighborhood of preceding position. The Cartesian product of the sets of feasible trajectories is denoted as . A sample environment with some obstacles and three stations is illustrated in Fig. 1.

A new set of tasks is received in each episode. Each task is defined as a tuple, , where is the location, are the arrival and departure times (time window), and the value function is a mapping from the numbers of robots serving the task during the time window to the resulting value. In order to serve a task, a robot should spend at least one time step at that location, i.e., , during the time window. We refer to each repetition of position, , along a trajectory as a stay. In this setting, multiple tasks can arrive at the same location in an episode, but we assume111 This assumption is only made to simplify the notation and presentation in our derivations. In Section 4.4, we discuss how this assumption can be lifted to use our proposed approach in cases where multiple tasks are simultaneously active at the same location. that the time windows of such tasks with identical locations do not overlap, i.e.,

 ℓi=ℓj⇒min{tdi,tdj}≤max{tai,taj},∀i≠j. (4)

Accordingly, the number of robots staying at each location uniquely determines their impact on performance since they all serve the same task (if there is one). For any and , we use to denote the number of robots that stay at the location of the task, , from time to , i.e.,

 ci(p,t)=∣∣{rj∈R∣ptj=pt+1j=ℓi}∣∣. (5)

We use to denote the vector of counter values during the time window of the task , i.e.,

 ci(p)=[ci(p,tai),…,ci(p,tdi−1)]T, (6)

and the resulting value from the task is , where is the maximum value that can be obtained from the task (e.g., when the task is completed as desired). To accommodate various types of cooperative tasks, we do not make any assumptions on the value functions except for the mild assumption that having more robots can never hurt the outcome, i.e.,

 ci(p)≥ci(p′)⇒vi(ci(p))≥vi(ci(p′)). (7)

In the remainder of the paper, we will say that task is completed if it yields the maximum value . While we will use tasks with binary value functions (0 or ) in our examples for simplicity, our methods are applicable to tasks with more generic value functions with higher resolution. Under this model, the robots are assumed to be capable of achieving the required low-level coordination for each task (e.g., moving an object together). Accordingly, each task is completed if it is served by sufficiently many robots during the corresponding time window. We quantify the performance in an episode via the total value from tasks, i.e.,

 f(p)=∑τi∈τvi(ci(p)), (8)

where denotes the trajectories of all robots. In each episode, the robots plan their own trajectories to maximize (8

). Such a distributed coordination problem can be solved by utilizing methods from machine learning, optimization, or game theory (e.g.,

Bu et al. (2008); Boyd et al. (2011); Marden et al. (2009)). In this paper, we will study this problem from a game theoretic perspective.

## 3 Game Theory Preliminaries

A finite strategic game is defined by three elements: (1) the set of agents (players) , (2) the action space , where each is the action set of agent , and (3) the set of utility functions , where each is a mapping from the action space to the set of real numbers. Any action profile can be represented as , where is the action of agent and denotes the actions of all other agents. An action profile is a Nash equilibrium if no agent can increase its own utility by unilaterally changing its action, i.e.,

 Ui(a∗i,a∗−i)=maxai∈AiUi(ai,a∗−i),∀i∈I. (9)

A game is called a potential game if there exists a function, , such that for each player , for every , and for all ,

 Ui(a′i,a−i)−Ui(ai,a−i)=ϕ(a′i,a−i)−ϕ(ai,a−i). (10)

Accordingly, whenever an agent unilaterally changes its action in a potential game, the resulting change in its own utility equals the resulting change in , which is called the potential function of the game.

In game theoretic learning, the agents start with arbitrary initial actions and follow a learning algorithm to update their actions in a repetitive play of the game. At each round , each agent plays an action and receives the utility . In general, a learning algorithm maps the observations of an agent from the previous rounds to its action in round . In this paper, we will only consider algorithms with a single-stage memory, where the action in round depends only on the observation in round , to achieve the desired performance. For potential games, one such learning algorithm that achieves almost sure convergence to a Nash equilibrium is the best-response (BR) (e.g., see Young (2004) and the references therein). While any global maximizer of the potential function is necessarily a Nash equilibrium, potential games may also have suboptimal Nash equilibria. For any potential game with the set of Nash equilibria , the comparison of the worst and the best Nash equilibria can be achieved through the measure known as the price of anarchy (PoA), i.e., . For potential games with high PoA, noisy best-response algorithms such as log-linear learning (LLL) Blume (1993) can be used to have the agents spend most of their time at the global maximizers of

. More specifically, LLL induces an irreducible and aperiodic Markov chain over the action space such that the limiting distribution,

, satisfies

 limϵ→0+μϵ(a)>0⟺ϕ(a)≥ϕ(a′),∀a′∈A. (11)

Based on (11), as the noise parameter of LLL, , goes down to zero (as LLL becomes similar to BR), only the action profiles that globally maximize maintain a non-zero probability in the resulting limiting distribution .

## 4 Game Design

In this section, we map the DTE problem to a potential game, , whose potential function is equal to (8). Once such a game is designed by defining the action sets and the utility functions, learning algorithms such as BR or LLL can be used to reach the desired joint plans in a distributed manner.

### 4.1 Action Space Design

The impact of each agent on the overall objective in (8) is determined only by its trajectory. Accordingly, one possible way to design the action space is to define each action set as , which contains all the feasible trajectories. However, the number of feasible trajectories, , grows exponentially with the episode length , and the learning process typically gets slower as the agents need to explore a larger number of possibilities. Furthermore, many standard learning algorithms such as BR and LLL require the updating agent in each round to compute all the possible utilities that can be obtained by switching to any of its feasible actions. Hence, a large action set increases not only the number of rounds needed for the convergence of learning but also the computation time required by the updating agents in each round. Motivated by the practical importance of computation times, we aim to design the smallest action sets that can still yield the optimal joint plan.

We design the minimal action sets by excluding a large number of feasible trajectories that can never be essential to the overall performance, regardless of the trajectories taken by the other robots. For example, if a trajectory does not serve any task, i.e., there is no such that for some , then it is guaranteed that robot will not be contributing to the global score in (8) when traversing since it will not be contribution to any counter in (5). Accordingly, such can be removed from the action set without causing any performance loss. Furthermore, if a trajectory has all the task-serving stays contained in some other trajectory , then removing from the action set (while keeping ) would not degrade the overall performance. Accordingly, we define each action set as follows:

 Ai=\operatornamewithlimitsargmin∅⊂A′i⊆Pi |A′i| (12) s.t.

where the constraint is

 ∀qi∈Pi∖A′i,∃pi∈A′i:pti=pt+1i=qti,∀t∈t∗(qi,τ), (13)

and is the set of times where involves a stay at a task location within the corresponding time window, i.e.,

 t∗(qi,τ)={t∣∃τj∈τ,qti=qt+1i=ℓj,taj≤t

Accordingly, each robot’s action set is the smallest non-empty subset of its all feasible trajectories such that for every excluded trajectory , there exists a trajectory such that any stay in within the corresponding active time window is also included in , i.e., (13). Note that any with no task-serving stays, i.e., , is trivially removed from the action set as (13) does not impose any restriction on the removal of such . Our next result formally shows that this reduced action space does not cause any suboptimality.

###### Lemma 1

For the sets of feasible trajectories as in (3) and the action sets as in (12), and satisfy

 maxp∈Pf(p)=maxp∈Af(p). (15)
###### Proof

Since ,

 (16)

Now, let be a maximizer of , i.e.,

 f(q)=maxp∈Pf(p). (17)

Due (13), there exist such that, for every robot , all the stays in that takes place at a task location within the corresponding time windows are also included in . To be more specific, for any and , we have

 qti=qt+1i=ℓj,taj≤t

Accordingly, any stay in that may contribute to a counter (see (5) and (6)) is also included in , which implies for every task . Hence, due to (7) and (8), we have and, due to (17),

 (19)

Consequently, (16) and (19) together imply (15). ∎

### 4.2 Utility Design

We utilize the notion of wonderful life utility Tumer and Wolpert (2004) to design a game whose potential function is the total value in (8). Accordingly, we define the utility of each robot as its marginal contribution to the total value, i.e.,

 Ui(p)=∑τj∈τ[vj(cj(p))−vj(cj(p−i))], (20)

where is the counter associated with that disregards agent , i.e.,

 cj(p−i,t)=|{rk∈R∖{ri}∣ptk=pt+1k=ℓi}|. (21)

As per (20), the utility of each robot , i.e., , is equal to the total value of tasks that are completed under the trajectories and would not be completed without (under the trajectories ).

###### Lemma 2

Utilities in (20) lead to a potential game whose potential function equals the total value received from the tasks, i.e., .

###### Proof

Let be two possible trajectories for any robot , and let denote the trajectories of all other other robots. Using (20), we have

 Ui(pi,p−i)−Ui(p′i,p−i)=∑τj∈τvj(pi,p−i)−∑τj∈τvj(p′i,p−i), (22)

which, together with (8), implies

 Ui(pi,p−i)−Ui(p′i,p−i)=f(pi,p−i)−f(p′i,p−i). (23)

Consequently, is the potential function for .

Example 1: Consider the environment in Fig. 1 with 3 robots, two stationed at in cell and one stationed at in cell . Let , and consider a single task at location that can be completed at any time during the episode by moving some boxes as illustrated in Fig. 2. More specifically, the task first requires moving a heavy box, which can be handled by at least 2 robots, and then moving the two light boxes, each of which can be handled by a single robot, to the heavy box’s initial location. Suppose that moving the heavy box to its desired position takes one time step. Similarly, a single robot can move one light box to the desired location in one time step.

Such a task first requires at least 2 robots to serve this location together for one time step, and then the total number of robots serving that location within the remaining time to be at least 2. Suppose that the task yields a value of 1 if successfully completed. Accordingly, this task can be represented with the following specifications: , , , and

 v1(c1(p))=⎧⎨⎩1, if ∃i,[c1(p)]i≥2, ∑j>i[c1(p)]j≥2 0, otherwise., (24)

where and denote the and entries of the counter vector . The value function in (24) implies that the task is completed if there exists an index such that 1) the entry of is at least two, and 2) the summation of the entries with indices are at least two. Given these task specifications, let the trajectories of the three robots over the episode of six time steps, i.e., for , be as follows:

 p1={(2,2),(3,3),(3,3),(3,3),(3,3),(3,3),(2,2)},
 p2={(2,2),(3,3),(3,3),(3,3),(3,3),(3,3),(2,2)},
 p3={(4,5),(3,4),(3,3),(3,3),(3,3),(3,4),(4,5)}.

In that case, the task is completed since as per (6). The task can be completed without or since

 c1(p−1)=c1(p−2)=[0,1,2,2,1,0]T.

Accordingly, and receive the utilities . Similarly, if is removed from the system, and can still complete the task since . Hence, also receives a utility of zero, . As such, although the task is completed in this example, none of the robots would receive a utility since their marginal contributions to the value received are all equal to zero, i.e., the task would still be completed by the remaining two robots if any single robot was removed from the system.

### 4.3 Communication and Information Requirements

Both the action set in (12) and the utility in (20) can be computed by each robot based on local information. To be more specific, for each robot we first define two sets: 1) set of reachable tasks (robot can serve the location of the task for at least one time step and return to its station within steps),

 τilocal={τj∈τ∣dist(ℓj,σi)

where denotes the minimum number of transitions (allowing diagonal transitions) needed to reach the task location from the station , and 2) set of robots who has a common reachable task with , i.e.,

 Rilocal={rj∈R∣τilocal∩τjlocal≠∅}. (26)

Accordingly, if each robot has the following information: 1) the specifications of each task , and 2) the trajectory of each robot , then each robot can compute its own utility in (20). Furthermore, for any trajectory since any task can never be served under any feasible trajectory of . Hence, such local information is also sufficient for each robot to compute its action set as per (12). We assume that each robot is able to obtain the specifications of tasks in and the trajectories of robots in through local communications.

In Section 5, we present a learning process where the robots repetitively play and, at each round, a randomly picked robot updates its trajectory based on the utilities it can obtain from different trajectories. Accordingly, first, all the robots are given the specifications of their reachable tasks and they broadcast their initial trajectories to their neighbors in the beginning of the learning process. Then, at each round, only the updating agent needs to broadcast its new trajectory. In such a learning process over rounds, each robot would need to communicate, by either broadcasting its updated trajectory or receiving an update from another robot in , in approximately of the rounds, which defines the approximate communication load of the learning process on each robot.

### 4.4 Tasks with Identical Locations and Overlapping Time Windows

Our derivations so far were based on the assumption that tasks arriving at the same location do not have overlapping time windows. This assumption was made just to simplify the notation and define the action of each robot as its trajectory. In this subsection, we show how this assumption can be easily lifted to use our proposed approach when multiple tasks may be active at the same location. In particular, this extension is achieved with minor modifications to the action sets in (12) and the counters in (6). We denote these modified versions as and , which are defined below.

Once each is generated according to (12), can be obtained from by adding a second dimension to the actions. This second dimension is used to distinguish between the cases where different tasks are served under the same trajectory. Given a set of tasks , each consists of a trajectory and an additional sequence that indicates which task is being served by at time (e.g., if no task is served by at time ). More specifically, let denote the tasks in that can be served at time by a robot following the trajectory , i.e., the set of tasks such that stays at the location of the task at time and the task is active at time ():

 θ(pi,τ,t)={τj∈τ∣pti=pt+1i=ℓj,taj≤t

Accordingly, the action set is defined as

 A+i={(pi,zi)∣pi∈Ai,zti=0 if θ(pi,τ,t)=∅,zti∈θ(pi,τ,t) if θ(pi,τ,t)≠∅}, (28)

where each action consists of a trajectory and a sequence that explicitly states the tasks plans to serve while taking the trajectory . As such, and is obtained by minimally increasing the size of the action set so that each action uniquely identifies the service provided by each robot. Furthermore, when contains no tasks with identical locations and overlapping time windows.

Example 2: Consider the environment in Fig. 1 with a single robot stationed at in cell . Let , and consider two tasks such that the locations are and the arrival and departure times are , , , . In this example, (12) results in consisting of a single trajectory: . Along this trajectory, it is clear that the robot is serving during its stay at since has not arrived yet. However, this trajectory does not uniquely describe which task is served during the stay at as both tasks are active at that time. By growing the action set as in (28), we obtain an action set containing two actions: Note that any choice from uniquely determines the service provided by the robot.

In addition to extending the action sets as , we also need to make a minor modification to the definition of the counters associated with the tasks. More specifically, we replace (5) and (6) with

 c+i(p+,t)=∣∣{rj∈R∣ptj=pt+1j=ℓi,ztj=τi}∣∣, (29)
 c+i(p+)=[c+i(p+,tai),…,c+i(p+,tdi−1)]T, (30)

where is the action profile in the modified action space. Accordingly, each robot contributes to the counter of a task if it stays at the corresponding location during the corresponding time window and commits to serving as per . By using these modifications to the action sets and the counters and defining the utilities accordingly as per (20), i.e.,

 Ui(p+)=∑τj∈τ[vj(c+j(p+))−vj(c+j(p+−i))], (31)

we complete the design of the for the cases where multiple tasks may be simultaneously active at the same location. In the remainder of the paper, we will continue discussing our derivations in the setting where the action of each robot is defined as its trajectory (tasks with identical locations do not have overlapping time windows). However, all of our results can be easily extended to the generalized case by using the modified game design presented here.

## 5 Learning Dynamics

Once the specifications of tasks in the upcoming episode are provided to the robots and each robot computes its action set as in (12), various learning algorithms can be used by the robots in a repetitive play of to optimize (8) in a distributed manner. We consider two conventional learning algorithms with different performance guarantees for potential games: Best-Response (BR), which ensures convergence to a Nash equilibrium, and Log-Linear Learning (LLL), which ensures the stochastic stability of joint plans that maximize the potential function (e.g., see Blume (1993); Young (2004) and the references therein). In this setting, the learning algorithm serves as a distributed optimization protocol where each robot updates its intended plan based on the specifications of the tasks in , which is defined in (25), and the plans of the other robots in , which is defined in (26). Under these algorithms, a random agent is selected to make a unilateral update in each round, and that agent plays a best response or a noisy best response (log-linear) to the recent actions of the other agents. The selection of a random agent at each round can be achieved in a distributed manner without a global coordination, for instance by using the asynchronous time model in Boyd et al. (2006). In the best-response algorithm, the updating agent picks a maximizer of its utility function (assuming the actions of others will stay the same) as its next actions (picks the current action if it is already a maximizer). In the log-linear learning, the updating agent randomizes the next action over the whole action set with probabilities determined by the corresponding utilities (similar to the softmax function). Accordingly, the agent assigns much higher probabilities to the actions that would yield higher utility. Both algorithms are formally described below.

 Best Response (BR) 1:initialization: k=0, arbitrary p(0)∈A. 2:repeat 3:  Pick a random agent ri∈R. 4:  Compute BR(p−i(k))=argmaxpi∈AiUi(pi,p−i(k)). 5:  pi(k+1)={pi(k), if pi(k)∈BR(p−i(k)),Random in BR(p−i(k)), otherwise. 6:  p−i(k+1)=p−i(k). 7:  k=k+1. 8: end repeat
 Log-Linear Learning (LLL) 1:initialization: k=0, arbitrary p(0)∈A, small ϵ>0. 2:repeat 3:  Pick a random agent ri∈R. 4:  Randomize the next action of ri: Pr[pi(k+1)=pi]∼exp(Ui(pi,p−i(k))ϵ), ∀pi∈Ai. 5:  p−i(k+1)=p−i(k). 6:  k=k+1. 7: end repeat

Our next results provide the formal guarantees on the evolution of the global score in (8) when robots follow BR or LLL in a repeated play of . In particular, we first show that if all robots follow BR, then the value of completed tasks converges to a value within of the maximum possible value with probability one as the number of rounds, , goes to infinity.

###### Theorem 5.1

Let be designed as per (12) and (20). If all robots follow BR in a repeated play of , then with probability one

 limk→∞f(p(k))≥maxq∈Pf(q)PoA(ΓDTE). (32)
###### Proof

Since is a potential game, the best response dynamics achieve convergence to a Nash equilibrium with probability one (e.g., see Young (2004) and the references therein). From the definition of PoA, this implies that with probability one

 limk→∞f(p(k))≥maxq∈Af(q)PoA(ΓDTE). (33)

Using (33) together with (15), we obtain (32).

Our next result shows that if all robots follow LLL with an arbitrarily small noise parameter , then the probability of obtaining trajectories that maximize the total value becomes arbitrarily close to one as the number of rounds, , goes to infinity.

###### Theorem 5.2

Let be designed as per (12) and (20). If all robots follow log-linear learning (LLL) in a repeated play of , then

 limϵ→0+limk→∞Pr[f(p(k))=maxq∈Pf(q)]=1. (34)
###### Proof

Since is a potential game with the potential function , LLL induces an irreducible and aperiodic Markov chain with the limiting distribution over (e.g., see Young (2004) and the references therein) such that as (the noise parameter of LLL) goes down to zero, only the global maximizers of maintain a non-zero probability in , i.e.,

 limϵ→0+μϵ(q)>0⟺f(q)=maxq′∈Af(q′). (35)

Accordingly, the trajectories at the round of learning, , satisfy

 limϵ→0+limk→∞Pr[p(k)=q]>0⟺f(q)=maxq′∈Af(q′), (36)

which implies

 limϵ→0+limk→∞Pr[f(p(k))=maxq∈Af(q)]=1. (37)

Using (37) together with (15), we obtain (34).

Based on Theorems 5.1 and 5.2, both BR and LLL provide guarantees on the trajectories as the number of rounds, , goes to infinity. In practice, there would be a finite amount of time for planning the trajectories before each episode in the DTE problem. Accordingly, our proposed solution is to have the robots update their trajectories via learning in over a finite number of rounds (available time between episodes) and then dispatch according to the resulting trajectories. When the learning horizon is sufficiently long, the performance induced by BR or LLL would be close to the respective limiting behavior. More specifically, for sufficiently large : 1) is within of the maximum possible value with a high probability under BR, and 2) equals the maximum possible value with a high probability when LLL is executed with a sufficiently small noise parameter .

In light of Theorems 5.1 and 5.2, a major consideration in choosing the learning algorithm is the price of anarchy (PoA). If all Nash equilibria yield reasonably good , then best-response type algorithms can achieve the desired performance. Such an approach has the benefit of having a monotonic increase in the global objective in (8) as the robots update their plans, i.e., for all . On the other hand, if some Nash equilibria are highly suboptimal, noisy best-response type algorithms such as LLL can be used to ensure that the learning process does not converge to an undesirable Nash equilibrium and, while does not increase monotonically under the resulting learning process, a global optima of is observed most of the time in the long-run as robots keep updating their plans. We continue our analysis by investigating .

### 5.1 Price of Anarchy

We first provide an example to show that can be arbitrarily large in general when there is not a special structure in the task specifications.

Example 3: Consider the environment in Fig 1, and let each episode consist of three time steps (). Let there be two robots , both stationed at in cell . Suppose that we have three tasks with identical time windows, and , and different locations: , , . Each task requires the handling of some boxes and can be completed if sufficiently many robots (depending on the weight of boxes) stay at that location for one time step, i.e., the value functions have the form

 vi(ci(p))={¯vi,% if max(ci(p))≥c∗i,0, otherwise. (38)

Suppose that (light boxes), (heavy boxes), and . In this setting, using (12), the action set of each robot consists of three trajectories: going to any of the three task locations, staying there for one step and coming back. It can be shown that this scenario has three Nash equilibria with the following outcomes: 1) completes and completes , 2) completes and completes , and 3) and together complete . While the first two cases result in a total value of , the last option yields a total value of . Accordingly, equals , which can be arbitrarily large.

Example 3 shows that in general may be arbitrarily large. However, there are also instances of the problem where is small. We will first give a definition and then present a family of such cases with a bound on .

###### Definition 1

(Simple Task) A task is simple if it can be completed by one robot in one time step, i.e., the value function has the form

 vi(ci(p))={¯vi,% if max(ci(p))≥1,0, otherwise. (39)

One real-life example of a simple task is an aerial monitoring task that requires taking images from a specific location within a specific time window. When the grid cells correspond to sufficiently small regions, such a monitoring task can be completed by a single drone within a single time step. Similarly, certain pick-up and delivery or manipulation tasks can be completed by a single robot in a single time step.

###### Theorem 5.3

Let be designed as per (12) and (20). For a system with robots and tasks, if there is only one station and all the tasks are simple, then the price of anarchy of is bounded as

 PoA(ΓDTE)≤max(mn,1). (40)
###### Proof

Consider a single-station game with robots and simple tasks. Note that in such a game, all the robots have identical action sets. Let be the set of Nash equilibria, and let be any Nash equilibrium of the game. We analyze each of the two possible cases separately and show that (40) holds in both cases:

Case 1 - All tasks are completed under : In this case, clearly since cannot exceed the total value of tasks in .

Case 2 - Some tasks are not completed under : Let be the set of incomplete task under . Since all tasks are simple, any robot could switch to a trajectory completing some to receive a utility of . Accordingly, since is a Nash equilibrium, each agent’s utility must be larger than the value of any incomplete task, i.e.,

 Ui(p∗)≥maxτj∈τ′¯vj,∀i∈{1,…,n}, (41)

which implies that each agent must be receiving a positive utility by being essential for the completion of at least one task (the task would be incomplete without that agent) due to (20). Since each task is simple, multiple agents cannot be essential for the same completed task. Hence, at least tasks must be completed. Furthermore, each completed task’s value is included in at most one agent’s utility. Hence, the total value of completed tasks cannot be less than the total utility of the agents, i.e.,

 ∑τj∈τ∖τ′¯vj≥n∑i=1Ui(p∗). (42)

Since the number of completed tasks is at least , the total value of incomplete tasks must be upper bounded by times the value of the most valuable incomplete task, i.e.,

 ∑τj∈τ′¯vj≤(m−n)maxτj∈τ′¯vj. (43)

Using (41) and (42), we obtain

 f(p∗)=∑τj∈τ∖τ′¯vj≥n∑i=1Ui(p∗)≥nmaxτj∈τ′¯vj. (44)

Since cannot exceed the total value of tasks in , we have

 maxp∈A∗f(p)≤∑τj∈τ∖τ′¯vj+∑τj∈τ′¯vj=f(p∗)+∑τj∈τ′¯vj. (45)

Using (43), (44), and (45), we obtain

 maxp∈A∗f(p)≤f(p∗)+(m−n)maxτj∈τ′¯vj≤f(p∗)+m−nnf(p∗), (46)

which implies