I Introduction
Robotic platforms (such as drones) hold incredible promise for tasks such as reconnaissance, search and rescue, and delivery of essential items in dangerous environments (e.g., during natural disasters, or in battlefields). In recent decades, significant focus has been put toward path planning for robots by solving graph theoretic problems [7]. A common goal for these problems is to collect rewards (e.g., information or objects) by visiting physical locations. These problems are known as Orienteering Problems (OP) and many variants are applicable to realistic engineering problems, such as environmental inspection [5], multiagent search [4], and sensor data collection [3]
. Typically, variants of OP are NPHard, and must be solved via heuristics like Monte Carlo methods, and linear programming
[2, 7, 6, 12]. Up until recently, one aspect often missing from these formulations is that robots come at a cost, and they face the risk of failure when deployed in dangerous environments. Therefore, to maximize the effectiveness of a limited supply of robots in such settings, one needs to weigh the reward of achieving a given task against the risk of robot failure and cost of replacement.To account for this, one can construct a variant of OP where each edge in the graph has an associated risk of failure. Then, rather than finding a tour which maximizes the collected reward, the goal becomes to find the tour that maximizes the expected reward, i.e., riskaware OP [4, 11]. These kinds of problems are crucial for logistic operations in regions with adversaries, such as maritime transportation [13]. Recently, it was shown that the solution to some team riskaware OP problems have a matroid structure which allow for matroid optimization techniques [4]. However, if the problem introduces some cyclic behavior like a return depot, then the cycles can break the matroid structure and other optimization techniques are required [10]. The return depot formulation encompasses many realworld applications like search and rescue, sensor data collection, and package delivery.
In this paper, we focus on a particular OP where a single agent delivers packages (e.g., containing aid or other essential items) from a depot to a set of locations. Delivering a package to a given location earns a reward (specific to that package and location), but also has a (locationspecific) risk of failure for the agent that is delivering the package. Furthermore, packages are desired to be delivered to the various locations repeatedly (i.e., over a set of epochs representing, for example, hours, days, or weeks). Each epoch has the same set of packages as any other epoch. Multiepoch missions have been explored in the context of intermittent deployment [8, 9]. However, they have not been explored in the context of riskaware task allocation before. For each epoch, a dispatcher determines the order of packages for the agent in order to maximize the expected reward from delivering packages while accounting for risk of failure along the way. The key challenge in this setting is to identify a rigorous strategy for dispatching the agent that accounts for all of these various features (drone costs, taskspecific rewards, taskspecific risks of failure, and the need to plan over multiple epochs). Additionally, we will consider both a finite horizon and an infinite horizon setting where the number of epochs is finite and infinite, respectively. For the finite horizon case, we prove an optimal time algorithm where is the number of packages. For the infinite horizon case, we map the problem to an isomorphic Markov Decision Process (MDP) and prove the optimal solution can be found in time.
Our work differs from the previously mentioned studies in two main ways. Firstly, insofar as we are aware, maximizing the expected reward across multiple epochs has not been considered in any previous riskaware OP variant. Secondly, while the singleagent variant we consider in this paper does form a matroid, our greedy solution is optimal without the need of matroid heuristics.
Ii Problem Formulation
We consider time as being measured by a set of epochs , where each epoch can represent an hour, a day, a week, etc., as appropriate for the scenario. There is a depot which contains a set of packages at the start of each epoch, where each package, is requested to be delivered to a certain location. We assume that the package for each location is unique (i.e., a package for one location cannot be delivered to another location), and thus we use to denote both the package and the destination location. Furthermore, we assume that the set of packages is replenished at the depot at the start of each epoch. Each package has an associated reward if it is successfully delivered to its target location in a given epoch. An agent (e.g., drone) is available at the depot at the first epoch to deliver the packages to their desired locations. Since the agent may fail during an epoch in our problem formulation, for each we define , if the agent survives to epoch and otherwise. The agent can carry at most 1 package at a time; after it has delivered the package it is carrying, it must return to the depot to pick up another package. For each package
that the agent attempts to deliver, there is a probability that the agent will fail either en route to the target location, or on the way back to the depot. We assume that the events representing successful agent traversal of each leg of the trip are independent and have equal probability, given by
for some . Thus the probability that the agent successfully delivers package and returns to the depot is given by . We also assume that the events denoting successful agent traversal for different packages are independent. If the agent fails while delivering a package, a cost of is incurred and the agent cannot deliver any more packages for the current or any future epoch.During each epoch that the agent is alive it executes an assigned delivery plan in some order specified by the dispatcher. Due to the fact that the agent can only carry one package at a time, a package delivery plan for epoch is represented by an ordered set of cycles where each package delivery has the form , with representing the th package being assigned for delivery by the agent in epoch .
This ordered set may not contain every package delivery available for the epoch if the dispatcher considers some packages too risky to deliver, i.e. . Let denote a package delivery plan for the entire epoch period of the mission. Let denote the expected reward provided by plan for epoch conditioned on whether the agent survives to epoch . The goal of the dispatcher is to maximize the expected reward across all epochs by constructing a package delivery plan. By the law of total expectation, the expected reward across all epochs for a given package delivery plan is:
(1) 
Then, the dispatcher’s goal is to solve the following problem:
Problem 1.
RiskAware SingleAgent Package Delivery (RSPD)
(2) 
Consider the set of cycles representing the sequence of package deliveries assigned to the agent on epoch . Assuming the condition (i.e., that the agent has survived to epoch ), the conditional probability that the agent completes a given delivery , and returns to base is dependent on the probability the agent survives every cycle prior to and including , i.e., . Let denote the probability that the agent successfully delivers the th package in epoch , given by . Then, the expected reward for is given by . Additionally, is dependent on the probability of finishing the last cycle for each epoch before . So for , we can calculate by:
(3) 
where is the probability of finishing the last cycle . Because we are guaranteed the agent is alive at the start of the first epoch, we have . If the agent survives to epoch , the conditional expected reward for epoch is calculated by:
(4) 
The first term captures the expected rewards of completing the ordered set of tasks in epoch , while the second term captures the expected cost of losing the agent in epoch . Substituting (3) into (2) reveals a telescopic relationship:
(5)  
Equation (5) can be written recursively as follows.
Definition 1 (Inductive Expected Reward for RSPD).
Given a package delivery plan for RSPD, the reward functions are defined recursively as follows:
(6) 
for .
This recursive relationship means we can maximize the total expected reward by a Bellman equation that works backwards from epoch :
(7)  
(8) 
Iii Optimal Package Delivery Plans
For any given package delivery plan, the dispatcher decides which packages to assign in each epoch and in what order. Naturally, when comparing two packages and , the dispatcher should pick the package with the highest reward and the lowest probability of failure first, i.e., and . However, there will be many cases where while , i.e., the most valuable packages are also the most risky. Hence, the dispatcher needs to compare package deliveries by a function of each package’s reward and its probability of failure. For each package delivery, we will define its rewardtorisk ratio as follows, and later show that it will serve as the primary means of comparison between packages.
Definition 2 (RewardtoRisk Ratio).
Given a package delivery cycle for a package , let the reward of the package be and the probability of successfully delivering the package from the depot be . We define the rewardtorisk ratio of cycle as:
(9) 
The numerator captures the expected reward for delivering the package, and the denominator captures the probability of the agent failing during the package delivery cycle.
Iiia Ordering of Cycles
Given a set of package deliveries, we first show that the dispatcher can construct a package delivery plan for each epoch by ordering the packages by their rewardtorisk ratios.
Lemma 1.
Consider the optimal package delivery plan for epoch , . Then, the package deliveries are in nonincreasing order of their rewardtorisk ratio, i.e. .
Proof.
By examining (7), we can see that the ordering of the deliveries does not change because the probability of finishing all the cycles in epoch is a product of the probability of completing each cycle. Therefore, we only need to prove that is maximized by ordering the cycles in nonincreasing order of their rewardtorisk ratios. We will prove this by contradiction. Consider an optimal plan and suppose such that . Using (9), we have
Which further implies,
(10) 
Now define to be the same as except with the packages in positions and swapped. Define and to be the probability of receiving rewards and in the ordering of . Then,
(11)  
Using the definition of and the ordering of and in , we know that and . Thus,
(12) 
By factoring out from all the terms and using (10) in (12), one can see that which contradicts the optimality of .
∎
Next, we provide a pair of results that specify exactly which packages should be included in the package delivery plan for each epoch.
Lemma 2.
Consider the optimal package delivery plan that maximizes the Bellman equation (7) for epoch . If the rewardtorisk ratio for package satisfies the following inequality, then it will be included in the optimal plan :
(13) 
Proof.
We will prove this by contradiction. Suppose is not included in the optimal plan . Then, the value function (7) for is
Now, let be the accumulated reward if the delivery cycle for is included in . Let and be the reward and probability of success for delivering (and returning), respectively. We relax the ordering of the packages so that is delivered after all packaged in . By doing this, Lemma 1 guarantees this ordering is a lower bound on . We have
Subtracting the value function for gives us
(14)  
(15) 
(16) 
∎
The derivations in the above lemma also show that including any task whose rewardtorisk ratio is equal to will not affect . We now show which tasks will be excluded from the optimal plan.
Lemma 3.
Any package delivery whose rewardtorisk ratio does not satisfy Lemma 2 is not included in the optimal package delivery plan for epoch , i.e.,
(17) 
Proof.
Corollary 1.
By consequence of Lemma 2 and Lemma 3, the optimal expected reward for epoch is obtained by the dispatcher assigning all packages whose rewardtorisk ratio satisfies (13) and ordering them by nonincreasing rewardtorisk ratio when calculating , i.e.,
Corollary 2.
The package delivery plan for epoch is always a subset of the package delivery plan for epoch , i.e.,
Using Corollary 1 and Corollary 2, we construct Algorithm 1 to find the optimal solution to RSPD. Algorithm 1 finishes in time because sorting the packages by their rewardtorisk ratio requires only . The while loop in line 1 can be accomplished in time instead of time by using Corollary 2 to construct package delivery plans from the indices in instead of copying packages for each iteration.
We also note here that the above results (other than Corollary 2) would also hold if the set of package delivery plans changed at each epoch. This is shown by the following corollary.
Corollary 3.
Consider an instance of RSPD where each epoch has a unique set of packages available for delivery. Then, Algorithm 1 calculates the optimal package delivery plan in time .
Proof.
Note that Lemmas 13 and Corollary 1 still hold for heterogeneous epochs. However, we cannot use Corollary 2 to remove the dependence on the time complexity. Analyzing Algorithm 1 with slight modifications to account for heterogeneous epochs and without using Corollary 2 gives a time complexity of . ∎
IiiB Infinite Horizon Package Delivery
Up until now we have only considered finite duration missions that are guaranteed to end. However, there may be scenarios where the dispatcher cannot assume the mission will ever end (e.g., logistics companies). We will now define an instance of RSPD with an infinite number of epochs and provide an optimal solution to the problem.
Problem 2 (Infinite Horizon RSPD (IHRSPD)).
Consider an instance of RSPD with an infinite number of epochs. Then, the problem is to maximize the following expected reward:
(18) 
Despite being an infinite sum, we can show that every package delivery plan for an instance of IHRSPD has a finite expected reward.
Lemma 4.
Every instance of IHRSPD has a finite expected reward.
Proof.
Consider any package delivery plan whose expected reward is:
(19) 
First, note that if the plan only specifies package deliveries for a finite number of epochs, then the expected reward is trivially finite. Thus, we focus on plans that specify deliveries over an infinite number of epochs.
If for any , , then and . This implies that the infinite summation (19) does not change if we remove from when . Therefore, we can restrict attention to plans which are composed of nonempty package delivery plans for each epoch.
The dispatcher can only choose from a finite number of nonempty package delivery plans for each epoch , i.e., . By (4), the set of all possible values for the expected reward during epoch is also finite and this set does not change with each epoch, i.e., . By the same argument, the set of possible probabilities of success for epoch also has a finite size and does not change with each epoch. Suppose we knew both the package delivery plan that maximizes such that and the package delivery plan that maximizes the probability of success such that .
Now that we have shown that every instance of IHRSPD has a finite expected reward, we will introduce the Markov Decision Process (MDP) formalism to prove that the optimal package delivery plan is stationary. To show this, we will define a MDP isomorphic to an instance of IHRSPD. Then, using the main result from [1], we will show that the optimal expected reward for the MDP and IHRSPD requires choosing the same package delivery plan for each epoch, i.e., a stationary package delivery plan. Now, we provide the following MDP formalism which will likely extend beyond IHRSPD and into many other variations of multiepoch risk aware task allocation problems.
Definition 3 (Markov Decision Process).
A Markov Decision Process (MDP) is specified by the set composed of the following structures. The statespace is a denumerable set, i.e., at most bijective to the natural numbers . The action set is a metric space containing all possible actions for every state . For each state , there exists a compact, action set of possible actions while in state . is a reward function defined for each pair of states and actions . Finally, the control transition probabilities are specified by a probability function that defines the probability of transitioning from state to state , under a given action.
Every MDP begins in an initial state and progresses to future states in the following manner. Given any current state , a controller follows some policy that specifies a desired action based on the current epoch and the state . Then, a reward is earned and the state transitions to a new state with probability . This process repeats indefinitely. Given that this process proceeds indefinitely, if we consider all possible policies for a fixed set of reward and probability functions, then there exists an expected reward for every policy over an infinite horizon.
Definition 4 (Expected Total Reward for InfiniteHorizon MDP).
Consider a MDP and some policy . Then, the expected total reward for the MDP beginning at state under policy is defined as:
(20) 
We will construct an isomorphic MDP by following the definition of IHRSPD. Firstly, the agent begins in an alive state and the dispatcher chooses a combination of packages for the agent to deliver while alive. To capture this behavior, construct a MDP whose initial state corresponds to the agent being alive. While the agent is in the alive state, the dispatcher chooses a combination of packages from the set of all packages . This choice is captured by a policy
choosing an action vector
from the action space , where is the set of all nbit strings and is the number of tasks in the instance of IHRSPD. For each action vector , the dispatcher treats each component of as a bit that determines if package is taken for the current epoch . We can define to be the associated package delivery plan for action vector where the order of the deliveries is given by Lemma 1. Given that the agent attempts to deliver the packages in , there exists chances for the agent to fail in between receiving rewards, i.e., before the first package delivery, between any two package deliveries, or after the last package delivery. Furthermore, given the action vector , we define the set as the set of all nbit strings that specify the subsets of packages the agent delivered before failure. Then, for each chance of failing, we specify a state where specifies the subset of packages the agent was able to deliver before failure. Then, the probability of delivering only the subset of packages specified by given the agent attempted to deliver all the packages specified by can be captured by a probability function .For all such that , let be the last package completed by the agent before failing as specified by . Then, where is the probability of completing cycle and and are the probabilities of successfully traversing between and the depot, and traversing between the depot and , respectively. If , then where is the probability of delivering the first package specified by . However, if , then where is the probability of delivering the last package and is the probability of failing between and the depot. Naturally, if , then . For the case where the agent delivers all the packages and returns to the depot, we specify the state and the probability where is the probability of finishing the last cycle and returning to the depot.
Definition 5 (MIHRSPD Transition Probability Functions).
For all and ,
For all other combinations of states ,
Using the same conventions, we can construct reward functions given different outcomes in IHRSPD. Namely, if the agent fails while delivering the packages specified by , then it will receive a partial reward dependent on . Namely, it will receive the sum of all rewards for the packages it was able to deliver, but then incur a cost for failure. Likewise, if the agent does not fail, i.e., , then it will receive the sum of all rewards for the packages specified by and return to the alive state . This is captured by the following reward functions.
Definition 6 (MIHRSPD Reward Functions).
For all and ,
For all other states ,
Given these probability and reward functions, we can properly define the following MDP which is isomorphic to IHRSPD by definition.
Definition 7 (MarkovIHRSPD (MIHRSPD)).
An instance of MIHRSPD is given by the set where is an instance of IHRSPD and is a MDP defined as follows. Define the statespace . Let the action space be the set of all nbit strings , where is the number of tasks in . Define the compact action spaces . Let the reward functions and probability transition functions be specified by using Definition 6 and Definition 5. Then, the problem is to find a policy in the set of all policies that solves
(21) 
Using the main result from [1], if the MDP in Definition 7 satisfies three simple properties, and every policy has a finite expected reward, i.e., and , then the optimal policy is stationary, i.e., .
Theorem 1 (Existence of Stationary Policies [1]).
Consider a MDP satisfying the following conditions.

For each , the mapping is continuous in .

For each , is an upper semicontinuous function of

The number of positive recurrent classes is a continuous function of where is the set of all stationary policies.
Then, there exists an optimal stationary policy that maximizes (20), i.e., .
A positive recurrent class, as mentioned in Requirement 3 of Theorem 1, is a set of states which will continually be revisited throughout the MDP. Requirement 3 is satisfied if every stationary policy for a given MDP has a continuous number of positive recurrent classes. In the case of MIHRSPD, the number of recurrent classes is constant, i.e., . We will prove in the following lemma and apply Theorem 1 to show the optimal policy is stationary.
Lemma 5.
By Theorem 1, the optimal policy for an instance of MIHRSPD is stationary.
Proof.
Consider an instance of MIHRSPD constructed from . One can easily verify that the reward functions in Definition 6, and the probability transition laws in Definition 5 are continuous functions over , which satisfies Requirements 1 and 2 of Theorem 1. For Requirement 3, if the agent attempts to take any positive number of packages as its stationary policy, i.e., , the agent will eventually fail to deliver all its packages and transition to a state . After staying in the state for one epoch, the state will automatically transition to the failure state, , where the MDP will stay forever. Therefore, is considered both an absorbing state and a recurrent state because a given subset of actions will always eventually transition the state to (absorbing) and stay in the state forever (recurrent). Likewise, if the action vector is always chosen as the stationary policy, then can also be considered an absorbing and recurrent state because the probability of transitioning back to is . Then, for any given stationary policy, the number of recurrent classes . Finally, using a similar proof as in Lemma 4, one can show that the expected reward for is finite because is isomorphic to . Therefore, MIHRSPD will always satisfy Theorem 1, which implies the optimal policy is stationary. ∎
Corollary 4.
The optimal package delivery plan for an instance of IHRSPD is stationary, i.e., .
Proof.
By Lemma 5, the optimal policy for a MDP constructed from instance of IHRSPD is stationary. Because is by definition isomorphic to , the optimal package delivery plan for is stationary. ∎
Given we know the optimal package delivery plan is stationary, we will now find the optimal stationary package delivery plan. To do this, we extend the notion of rewardtorisk ratio to apply to package delivery plans using epoch risk ratios.
Definition 8 (Epoch Risk Ratio).
Given a package delivery plan for epoch during an instance of RSPD or IHRSPD, let be the probability of finishing the last cycle in . Then, we denote the epoch risk ratio for as and compute it by:
Given the package delivery plan is stationary, the following lemma characterizes the expected reward using the epoch risk ratio.
Lemma 6.
Consider an instance of IHRSPD with a stationary package delivery plan, i.e., . Then, the expected reward for IHRSPD is given by the epoch risk ratio for the package delivery plan for any epoch :
Proof.
Because the package delivery plans for each epoch are equal, , the expected reward for each epoch (4) will also be equal, i.e., . Furthermore, the last cycles in each package delivery plan will also be equal for some . By applying the same assumption to (3), we can see that . Given (18), we can factor out the expectation from every epoch and rewrite (18) to be a geometric series which yields the epoch risk ratio. More specifically, we have
∎
Lemma 7.
The epoch risk ratio for a package delivery plan during epoch is maximal when where is the package delivery with the maximal epoch risk ratio, .
Proof.
We will prove this by contradiction. Consider the optimal package delivery plan whose epoch risk ratio is greater than . However, for each individual package delivery in , its rewardtorisk ratio is less than (the largest rewardtorisk ratio for any individual task). This implies
(22) 
Corollary 5.
By an extension of Lemma 7, the optimal stationary package delivery plan that solves an instance of IHRSPD is given by where is the package delivery with highest rewardtorisk ratio in .
The above result shows that in the case of the infinite horizon package delivery problem, the optimal strategy is to deliver only the most valuable package in each epoch. Thus, one can easily construct an time algorithm to solve IHRSPD by searching for the package with greatest rewardtorisk ratio.
Iv Conclusions and Future Work
In this work, we formulate the RiskAware Single Agent Delivery (RSPD) problem where a dispatcher assigns package deliveries to a single agent based on both the reward associated with the package and the probability of successfully surviving the delivery. If the agent fails middelivery, a cost is incurred and the agent cannot make future deliveries. Therefore, the dispatcher must weigh the reward from a delivery against the loss of all future rewards across multiple epochs of the problem. We solve variants of this problem, one where there is a finite number of epochs (finite horizon) and one where there is an infinite number of epochs (infinite horizon). For the finite horizon case, we provide an optimal algorithm with timecomplexity . For the infinite horizon problem, we construct an isomorphic Markov Decision Process (MDP) to prove the optimal package delivery plan is to deliver the same package forever. Finding the optimal solution to the infinite horizon case only takes time. Leveraging this connection to address multiagent variants of our problem may be of interest for future work.
V Acknowledgments
We would like to acknowledge Lintao Ye for helpful discussions on greedy algorithms.
References
 [1] (2000) A note on the existence of optimal policies in total reward dynamic programs with compact action sets. Mathematics of Operations Research 25 (4), pp. 657–666. External Links: ISSN 0364765X, 15265471, Link Cited by: §IIIB, §IIIB, Theorem 1.
 [2] (2016) Orienteering problem: a survey of recent variants, solution approaches and applications. European Journal of Operational Research 255 (2), pp. 315–332. External Links: ISSN 03772217, Document, Link Cited by: §I.
 [3] (2016) Riskaware planning for sensor data collection. Dissertations  All, 582. External Links: Link Cited by: §I.
 [4] (2017) The matroid team surviving orienteers problem: constrained routing of heterogeneous teams with risky traversal. Vol. . External Links: Document Cited by: §I, §I.
 [5] (2010) A survey of mobile robots for distribution power line inspection. IEEE Transactions on Power Delivery 25 (1), pp. 485–493. External Links: Document Cited by: §I.
 [6] (1990) The selective travelling salesman problem. Discrete Applied Mathematics 26 (2), pp. 193–207. External Links: ISSN 0166218X, Document, Link Cited by: §I.
 [7] (1992) The vehicle routing problem: an overview of exact and approximate algorithms. European Journal of Operational Research 59 (3), pp. 345–358. External Links: ISSN 03772217, Document, Link Cited by: §I.
 [8] (2019) Submodular optimization for coupled task allocation and intermittent deployment problems. IEEE Robotics and Automation Letters 4 (4), pp. 3169–3176. External Links: Document Cited by: §I.
 [9] (2020) Monitoring over the long term: intermittent deployment and sensing strategies for multirobot teams. In 2020 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 7733–7739. External Links: Document Cited by: §I.

[10]
(2019)
On optimal policies for riskaware sensor data collection by a mobile agent.
IFACPapersOnLine 52 (20), pp. 321–326.
Note:
8th IFAC Workshop on Distributed Estimation and Control in Networked Systems NECSYS 2019
External Links: ISSN 24058963, Document, Link Cited by: §I.  [11] (2017) A multirobot control policy for information gathering in the presence of unknown hazards. In Robotics Research : The 15th International Symposium ISRR, pp. 455–472. External Links: ISBN 9783319293639, Document, Link Cited by: §I.
 [12] (1984) Heuristic methods applied to orienteering. Journal of the Operational Research Society 35, pp. 797–809. External Links: Document, ISSN 14769360 Cited by: §I.
 [13] (2013) Agentbased model of maritime traffic in piracyaffected waters. Transportation Research Part C: Emerging Technologies 36, pp. 157–176. External Links: ISSN 0968090X, Document, Link Cited by: §I.