I Introduction
This paper addresses the discovery and synthesis of efficient search trajectories based on learned domainspecific policies and probabilistic inference. We also explore the use of nonuniform state space representations to enhance the performance of policy search methods.
Searching for a lost target or a person in the wilderness can be tedious, laborintensive, imprecise, and very expensive. In some cases, the manual search and rescue operations are also unsafe for the humans involved. These reasons motivate the use of autonomous vehicles for such missions. In many search and rescue problems, timeliness is crucial. As time increases, the probability of success decreases considerably [1], and every hour the effective search radius increases by approximately km [2]. We present an active sampling algorithm that generates an efficient path in realtime to search for a lost target, thus making an autonomous search and rescue mission realistic and more beneficial.
Given an approximate region of interest, the first task is to decide where to search. Within the region of interest, some areas have higher probability of containing the target than the others. Data shows that more than of searches based on a probability distribution over the region have been resolved successfully within a short duration [3]
. We model the spatial field of search using classic Gaussian mixture models based on sightings of the lost target. The modeled probability distribution map over the search region is used as an input for planning the search paths. The traditional way to assign probabilities to search regions is with a distance ring model: a simple ‘bullseye’ formed by the
, , , and probability circles [3]. There are more advanced methods that propose a Bayesian model that utilizes publicly available terrain data to help model lost person behaviors enabling domain experts to encode uncertainty in their prior estimations
[4]. Since modeling of the search region probability distribution is not the focus of this paper, we use a simpler approach depending only on location and time of the target sightings. Our search algorithm would perform efficiently independent of the underlying probability distribution model.Once the probability distribution of the search region is mapped, the task is to search this region efficiently such that the regions with high probability of finding the target (hotspots) are searched in the early phase of the exploration. Many complete coverage approaches use linesweep based techniques [5, 6, 7] which are not suitable for search and rescue tasks because of their nontimeliness. Cooper et al. discuss the relationship between area coverage and search effort density. They claim that any solution to the optimal search density problem is also a solution to the optimal coverage problem [8]. We present a proof showing that coverage of the search region to reduce probability mass is equivalent to searching for the target. We have previously demonstrated an algorithm to selectively cover a region based on an underlying reward distribution [9, 10]. This technique, however, does not scale up smoothly for larger regions because of the computational complexity.
In this paper, we present an adaptive sampling technique that generates efficient paths to search the missing target as fast as possible by performing nonuniform sampling of the search region. The path generated minimizes the expected time to locate the missing target by visiting high search probability regions using nonmyopic path planning based on reinforcement learning. A nonmyopic plan is one that accounts for possible observations that can be made in the future [11]. Fig. 1 presents an overview of the whole path generation system. Our algorithm gets trained with a generic model of the probability distribution, which can be a map generated by a Gaussian mixture. These learnt parameters are used on the generated lost target probability distribution map to come up with an action plan (policy ). For a given state, an action is chosen according to the probability distribution and thus a path is generated for the searcher robot. Training the system with discounted rewards helps the planner to achieve paths that cover hotspots at the earlier stages of the search task. A major contribution of this paper is the use of nonuniform state aggregation for policy search in the context of robotics.
The key feature of our search algorithm is the ability to generate action plans for any new probability distribution using the parameters learnt on other similar looking distributions, i.e. an action planner trained on generic search and rescue distribution models is capable of planning high rewarding paths with a new probability distribution map without being trained again.
Ii Problem Formulation
The search region is a continuous twodimensional area of interest with userdefined boundaries. The spatial search region is discretized into uniform grid cells, such that the robot’s position and the target’s position can be represented by two pairs of integers . Each gridcell
is assigned a prior probability value
of the target being present in that cell.The aim is to find the target in as short a time as possible. Formally, we specify this objective as maximizing , where is the time elapsed until the target is found and is a conditiondependent constant. This objective reflects the assumption that the probability of the target’s condition becoming bad is constant, and the aim is to locate the target in good condition.
We will specify the robot’s behavior using a parametrized policy. This is a conditional probability distribution
that maps a description of the current state of the search to a distribution over possible actions . Our aim will be to automatically find good parameters , after which the policy can be deployed without additional training on new problems. The maximization objective then becomes:(1) 
where the expectation is taken with respect to trajectories generated by the policy .
Iii Modeling of the Search Area
We model the spatial field of search using a classic Gaussians mixture model. The map is initialized according to these generated prior probabilities. An example map is illustrated in Fig. 2. Our search algorithm can take any map in accordance to the prior belief over where the target is. In this work, we will test our method on randomly generated Gaussian mixtures, but the algorithm could be trained on any type of distribution.
Iv Policy Gradient Based Search
Finding a sequence of actions that maximizes a longterm objective could be done using dynamic programming. However, in our formulation the system state is described using a map containing the percell probability of the target being present and this map changes as areas are visited by the agent. The result is an extremely large state space where dynamic programming is impracticable  especially if the time to solve each particular problem is limited.
Instead, we turn to methods that directly optimize the policy parameters
based on (simulated) experience. To apply these methods, we will first formalize the search problem as a Markov Decision Process (MDP).
The shaded region indicates the standard deviation over five trials on three different sized worlds.
Iva Formalizing search as MDP
A Markov Decision Process is a formal definition of a decision problem that is represented as a tuple , where and are the state and action space, models transition dynamics based on the current state and action, and defines the reward for the current stateaction pair. is a discount factor that reduces the desirability of obtaining a reward timesteps from now rather than in the present by . The objective is then to optimize the expected discounted cumulative reward , where is the optimization horizon.
In the proposed approach, we take the state to include the position of the robot as well as the map containing the perlocation presence probability for the target, . The actions we consider are for the robot to move to and scan the cell North, East, South or West of its current location. Transitions deterministically move the agent in the desired direction. When scanning does not reveal the target, the probability mass of the current cell is then reduced to ^{1}^{1}1More complex models specify a probability of detection (POD) given that robot and target are in the same area [8]. For now, our work assumes a probability of detection (POD) of 1. A more realistic POD could easily be included in our approach by updating the probability mass in the cell accordingly. Furthermore, note that as probability mass is cleared, the numbers in no longer sum up to 1, so is an unnormalized probability distribution..
The most intuitive definition of the reward function corresponding to (Eq. 1) would give the reward of for recovering the target, coupled with a discount factor
. However, this reward function has a high variance, as with the exact same search strategy the target could be found quickly, slowly, or not at all due to chance. Instead, we reward the robot for scanning cells with a high probability of containing the target. This does not introduce bias in the policy optimization, while reducing statistical variance
^{2}^{2}2Proofs are given in the Appendix.. A lower statistical variance typically allows optimal policies to be learned using fewer sampled trajectories.IvB Policy Gradient Approach
Policy gradient methods use gradient ascent for maximizing the expected return . The gradient of the expected return () guides the direction of the parameter () update. The policy gradient update is given by,
where is the learning rate. The likelihood ratio policy gradient [12] is given by,
(2) 
However, this expression depends on the correlation between actions and previous rewards. These terms are in expectation, but cause additional variance. Ignoring these terms yields lowervariance updates, which are used in the Policy Gradient Theorem (PGT) algorithm and the GPOMDP algorithm [13, 14, 15]. Accordingly, the policy gradient is given by,
(3) 
In this equation, the gradient is based on sampled trajectories from the system, with the state at the timestep of the sampled rollouts. Furthermore, is a variancereducing baseline. In our experiments, we set it to the observed average reward.
IvC Policy design
We consider a policy that is Gibbs distribution in a linear combination of features given by
(4) 
where is an
dimensional feature vector characterizing stateaction pair (
) and is an dimensional parameter vector. This is a commonly used policy in reinforcement learning approaches [14]. The final feature vector is formed by concatenating a vector for every action , where is a feature representation of the state space, and is the Kronecker delta. Thus, the final feature vector has entries, of which corresponding to nonchosen actions will be at any one time step.We consider two types of feature representations () for our approach. The first feature construction is to consider a vector that represents all rewards in robotcentric, as illustrated in Fig. 2(a). This feature vector grows in length as the size of the search region increases, resulting in higher computation times for bigger regions. In the second design we consider a multiresolution feature aggregation resulting in a fixed number () of features irrespective of the size of the search region. In this case, features corresponding to larger cells are assigned a value equal to the average of values of that fall in that cell. In multiresolution aggregation, the feature cells grow in size along with the distance from the robot location as depicted in Fig. 2(b). Thus, areas close to the robot are represented with high resolution and areas further from the robot are represented in lower resolution. The intuition behind this feature design is that the location of nearby target probabilities is important to know exactly, while the location of faraway target probabilities can be represented more coarsely. The multiresolution feature design is also suitable for bigger worlds as it scales logarithmically to the size of the world.
Both these feature designs were tested with the policy gradient based searching algorithm and we found that the multiresolution aggregated features produce higher accumulated rewards and better discounted rewards (Fig. 2(c), 2(d)). Also, the computation gets quadratically expensive as the size of the grid world increases (Fig. 2(e)). Based on these results, a nonuniform, multiresolution, robotcentric feature design is more beneficial for efficient searching. These results further strengthen our belief that the immediate actions are influenced by the nearer rewards and the farther lowresolution features enhance nonmyopic planning of the whole path. Theoretically, replacing the individual cell values by their averages causes some information loss. However, to plan for faraway target probabilities, a coarse location is enough. In fact, the results in Fig. 2(c) and Fig. 2(d) show that, perhaps due to the regulating effect of averaging, multiresolution grids perform better than the uniform representation even after extensive training.
Further in this paper we will only consider the multiresolution robotcentric feature design in our policy gradient search algorithm. The aggregated feature design is only used to achieve better policy search, but the robot action is still defined at the gridcell level.
V Experimental Results and Discussion
In this section, we will introduce the experimental setup and baseline methods, before presenting and discussing the experimental results.
running the three search methods for time steps.
Va Setup
We generated a generic training scenario with probability distributions for a lost target using Gaussian mixtures. We model the search space of a simulated aerial search vehicle as a grid world with each gridcell spanning . We used 20 rollouts in every iteration of the training phase. A discount factor of was used in these experiments. During the test phase, an action with maximum probability is chosen at a given state. Fig. 2 illustrates the probability distribution grid used for training our policy based searcher. However, as mentioned in Section III, the searcher algorithm could be trained on any other type of distribution too. Two significantly different testscenarios are presented to evaluate and compare the search algorithms. The first testscenario (Fig. 3(a)) comprises of two Gaussians imitating the probability distribution of a lost target. The second testscenario (Fig. 3(b)) cannot be represented as a mixture of a few Gaussians.
VB Baselines
Traditionally, exhaustive sampling of a partially observable, obstaclefree region employs a boustrophedon path [16]. The boustrophedon or lawnmower path is the approach a farmer takes when using an ox to plow a field, making back and forth straight passes over the region in alternating directions until the area is fully covered. Another efficient search pattern reported in the robotic search literature is spiral, which minimizes the time to find a single stationary lost target in several impressive idealized scenarios [17, 18].
VC Results and discussion
Spiral search can generate efficient, or even optimal, paths under suitable conditions such as the first test scenario (Fig. 4(b)). Its satisfactory performance is only assured for a restricted class of unimodal distributions (as opposed to that in Fig. 4(e)). We compare these algorithms based on the rewards collected by removing the probability mass in the search region, corresponding to the (discounted) probability of finding the target. Our goal is to reduce the probability mass as fast as possible by visiting regions with high probability mass (hotspots). We use discounted rewards as a metric to measure the timeliness of an algorithm.
The plots in Fig. 4(g) illustrate how our proposed method is the fastest to cover the target probability mass on both test scenarios, even though all methods eventually visit all states of potential interest. Spiral search performs on par with our algorithm in terms of total rewards accumulated. Nonetheless, policy based searcher out performs both, spiral and boustrophedon search techniques in total discounted rewards on the second test scenario by a significant margin. Thus the policy based searcher exhibit a higher order of timeliness, which is a key requirement for any search algorithm to be applicable in search and rescue missions.
It is important to note that despite the dissimilarity between the training scenario and the test scenario, the policybased search algorithm achieves better performance than the other methods.
Vi Conclusions
We presented an optimization algorithm that results in a policy that can generate nonmyopic search plans for novel environments in real time. Our policy gradient based search algorithm is well suited for applications in search and rescue because of its ability to come up with a search plan onthego, based on the new probability distribution of the lost target, with no time wasted on retraining. Other timecritical applications like aerial surveys after a calamity, water surface surveys to monitor and contain algal blooms, and searching for remains under ocean after an accident, can use the presented algorithm to explore the region of interest.
In the near future, we would like to enhance the performance of the search algorithm by exploring different feature aggregation techniques and designing an adaptive aggregation that can combine features in realtime according to the observations made during the survey.
Acknowledgment
We would like to thank the the Natural Sciences and Engineering Research Council (NSERC) for their support through the Canadian Field Robotics Network (NCFRN).
First, we want to show that giving a reward for clearing probability mass instead of locating the target does not bias the objective function.
Proposition 1
where the discount factor , is the target’s location, and the proxy reward is equal to the probability of the target being at the robot’s location according to ^{3}^{3}3The reward map is not normalized, yet after clearing a fraction or probability mass, the probability that the target has not been found yet is . If the target were not found yet, the normalized probability that the target is in cell is . So the probability of finding the target in indeed equals .. with is the time until the target is found if the target is found within time steps, or otherwise^{4}^{4}4The proof here assumes a static target for notational simplicity, but can be generalized to dynamic targets by making both and timestep dependent, and calculating expected values over all jointly..
For any trajectory,
where is the Kronecker delta. Since this equality holds for any trajectory, it must hold for a linear combination of trajectories.
Now, we want to show that the variance of gradient estimates using the proxy reward for clearing probability mass is lower than that of the gradient estimates using the original objectives.
Proposition 2
We use the definition of the variance and rearrange terms as in the GPOMDP method [13] to reformulate the proposition. We introduce the shorthand make the equations more readable, and obtain
(5)  
(6)  
(7)  
(8) 
) are expectations of unbiased estimates of the respective objective, and so equal the gradient of the expected value of the respective objective. By Proposition
4, these expected values are the same, so (6) and (8) cancel each other out. Writing out the squares in (5) and (7) and combining like terms yieldswhere we introduced the shorthand
and used the fact that , are independent of the target location . Note that is just the probability that , and since the agent cannot find the target twice, is nonzero only if . Thus, the proposition is equivalent to
(9) 
The lefthand side of this inequality can be expressed and upperbounded as follows:
(10)  
(11) 
The inequality is due to the sum of rewards always being smaller than 1 (as the rewards denote probabilities of finding the target, and would sum up to 1 if and only if the robot visits all cells where probability mass was initially present in this trajectory). Note that the expected value in (11) is nonnegative: it is a weighted sum of squared terms, where all weights are nonnegative (again, due to their interpretation as probabilities). Thus, (11) is nonpositive, so (10) must , confirming (9).
References
 [1] L. D. Pfau, “Wilderness search and rescue: Sources and uses of geospatial information,” Master’s thesis, Pennsylvania State University, 2013.
 [2] T. J. Setnicka, Wilderness search and rescue. Boston, US: Appalachian Mountain Club, 1980, no. 363.348 S495w.
 [3] R. J. Koester, Lost Person Behavior: A search and rescue guide on where to look—for land, air and water. Charlottesville, VA: dbS Productions LLC, 2008.
 [4] L. Lin and M. A. Goodrich, “A Bayesian approach to modeling lost person behaviors based on terrain features in wilderness search and rescue,” Computational and Mathematical Organization Theory, vol. 16, no. 3, pp. 300–323, 2010.
 [5] W. H. Huang, “Optimal linesweepbased decompositions for coverage algorithms,” in Proc. IEEE Int. Conf. Robotics and Automation (ICRA), vol. 1, 2001, pp. 27–32.
 [6] A. Xu, C. Viriyasuthee, and I. Rekleitis, “Optimal complete terrain coverage using an unmanned aerial vehicle,” in IEEE Int. Conf. Robotics and Automation (ICRA), 2011, pp. 2513–2519.
 [7] C. Berger, M. Wzorek, J. Kvarnström, G. Conte, P. Doherty, and A. Eriksson, “Area coverage with heterogeneous UAVs using scan patterns,” in Proc. IEEE Int. Symp. Safety, Security, and Rescue Robotics (SSRR), 2016, pp. 342–349.
 [8] D. C. Cooper, J. R. Frost, and R. Q. Robe, “Compatibility of land sar procedures with search theory,” Potomac Management Group Alexandria VA, Tech. Rep., 2003.
 [9] S. Manjanna, N. Kakodkar, M. Meghjani, and G. Dudek, “Efficient terrain driven coral coverage using Gaussian processes for mosaic synthesis,” in Conf. Computer and Robot Vision (CRV). IEEE, 2016, pp. 448–455.
 [10] S. Manjanna and G. Dudek, “Datadriven selective sampling for marine vehicles using multiscale paths,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, Canada, September 2017, pp. 6111–6117.
 [11] A. Singh, A. Krause, and W. J. Kaiser, “Nonmyopic adaptive informative path planning for multiple robots.” in IJCAI, vol. 3, 2009, p. 2.
 [12] R. J. Williams, “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning,” in Reinforcement Learning. Springer, 1992, pp. 5–32.

[13]
J. Baxter and P. L. Bartlett, “Infinitehorizon policygradient estimation,”
Journal of Artificial Intelligence Research
, vol. 15, pp. 319–350, 2001.  [14] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in neural information processing systems, 2000, pp. 1057–1063.
 [15] M. P. Deisenroth, G. Neumann, J. Peters et al., “A survey on policy search for robotics,” Foundations and Trends® in Robotics, vol. 2, no. 1–2, pp. 1–142, 2013.
 [16] H. Choset and P. Pignon, Coverage Path Planning: The Boustrophedon Cellular Decomposition. London: Springer London, 1998, pp. 203–209.
 [17] S. Burlington and G. Dudek, “Spiral search as an efficient mobile robotic search technique,” in Proceedings of the 16th National Conf. on AI, Orlando Fl, 1999.
 [18] M. Meghjani, S. Manjanna, and G. Dudek, “Multitarget rendezvous search,” in IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2016, 2016, pp. 2596–2603.