1 Introduction
In mobile robot path planning, the terrain is represented as a directed graph where the vertices are robot positions, the edges correspond to possible robot moves, and every edge is assigned the corresponding traversal time. A moving strategy specifies how the robot moves from vertex to vertex, and it can be deterministic or randomized.
In this work, we concentrate on infinitehorizon path planning problems where the robot performs a recurring task such as surveillance or periodic maintenance. The standard tool for specifying infinitehorizon objectives are frequencybased objective functions parameterized by the limit frequency of visits to the vertices. Unfortunately, these functions are insufficient
for expressing subtle optimization criteria used in specific areas such as robotic patrolling, and cannot faithfully capture all crucial properties of randomized strategies such as deviations/variances of relevant random variables. The latter deficiency represents a major problem often resolved indirectly by considering only
deterministic strategies, even in scenarios where randomization achieves strictly better performance and is easy to implement (see Example 1).Our contribution.
We design and investigate a class of recurrent reachability objective functions based on the following parameters:

[label=(0)]

the limit frequency of edges;

the expected time to hit a given set of vertices from another given vertex;

the expected square of the time to hit a given set of vertices from another given vertex.
Note that using 1, one can express the frequency of visits to vertices, and 2 and 3 allow to express the variance and standard deviation of the time to hit a given set of vertices from another given vertex. Thus, the recurrent reachability objective functions can “punish” large deviations from the expected values, allowing for balancing performance with stochastic stability.
Computing an optimal moving strategy for a given recurrent reachability objective is computationally hard. One can easily reduce the hard Hamiltonian cycle problem to the problem of deciding whether the minimum of a certain recurrent reachability objective function is bounded by a given constant. This means there is no efficient strategy synthesis algorithm with optimality guarantees unless .
We design a strategy synthesis algorithm based on gradient descent applicable to arbitrary recurrent reachability objectives involving piecewise differentiable continuous functions. The algorithm efficiently computes a finitememory randomized strategy where the memory is used to “remember” some relevant information about the history of visited vertices. Although the (sub)optimality of this strategy is not guaranteed for the reasons mentioned above, our experiments show that the algorithm can solve instances requiring nontrivial insights and produce solutions close to theoretical optima.
Thus, we obtain a general and efficient optimization framework for an expressively rich class of nonlinear infinitehorizon objectives capable of solving problems beyond the reach of existing methods.
1.1 Motivating Example
In this section, we give an example illustrating the limitations of frequencybased objectives and deterministic strategies, and we show how randomization and recurrent reachability objectives help to overcome these problems.
In robotic patrolling, some vertices in the terrain graph are declared as targets, and the robot aims to discover possible intrusions at the targets. One standard measure for the protection achieved by a given moving strategy is the maximal average idleness of a target [Huang et al.2019, Almeida et al.2004, Portugal and Rocha2011]
. In the language of Markov chains, this corresponds to the maximal
renewal time of a target, i.e., , where is the set of all targets and is the frequency of visits to (recall that is the expected time of revisiting ).Existing works about minimizing idleness aim at constructing a deterministic moving strategy, i.e., a cycle in the underlying graph [Huang et al.2019, Almeida et al.2004, Portugal and Rocha2011]. The next example shows that using randomized strategies brings additional benefits that have not been exploited so far.
Example 1.
Consider the graph of Fig. 1a with two targets where traversing every edge takes one time unit. Let be a deterministic strategy alternately visiting and , see Fig. 1b. Then, both and are revisited in time units, and the maximal renewal time is .
At first glance, it seems the robot cannot do any better. However, consider the randomized strategy of Fig. 1c. When the robot comes to from , it returns to
with probability
. With the remaining probability , it continues to . A symmetric decision is taken when the robot comes to from .As in , the frequency of visits to approaches , i.e., the renewal time of approaches . However, pushing close to results a strategy where the robot needs very long time to move from to (and vice versa). Such a strategy is clearly not appropriate for surveillance purposes. So, we may refine the objective and minimize the maximum of renewal times and the expected time of visiting one target from the other (note that this recurrent reachability objective is not frequencybased). A simple computation reveals that the optimal choice is then setting , yielding the maximum . This strategy does not have the above defect and outperforms the deterministic strategy .
Another way of eliminating the “defect” of for is to control the variance of the renewal time of , , which approaches as . Using recurrent reachability objectives, this can be expressed as, e.g., minimizing a weighted sum of the maximal renewal time and the maximal variance of renewal time. Thus, one may trade the value of renewal time with its stochastic stability.
1.2 Related Work
The finitehorizon path planning problem involving finding a feasible path between two given positions is one of the most researched subject in mobile robotics (see, e.g., [Choset2005, LaValle2006]). Recent technological advances motivate the study of infinitehorizon path planning problems where the robot performs an uninterrupted task such as persistent datagathering [Smith et al.2011] or patrolling [Huang et al.2019, Almeida et al.2004, Portugal and Rocha2011]. The (classical) vehicle routing problem and the generalized traveling salesman problem [Toth and Vigo2001] can also be seen as infinitehorizon path planning problems. The constructed strategies were originally meant for humans (drivers, police squads, etc.) and hence they are deterministic. The only exception is adversarial patrolling based on Stackelberg equilibria [Sinha et al.2018, Yin et al.2010] where randomization was soon identified as a crucial tool for decreasing the patroller’s predictability.
The existing objectives studied in infinitehorizon path planning problems are mostly based on longrun average quantities related to vertex frequencies or cycle lengths, such as mean payoff or renewal time (see, e.g., [Puterman1994]). Generalpurpose specification languages for infinitehorizon objectives are mostly based on linear temporal logics (see, e.g., [Patrizi et al.2011, Wolff et al.2014, Ulusoy et al.2014, Bhatia et al.2010]). A formula of such a logic specifies desirable properties of the constructed path, and may include additional mechanisms for identifying a subset of optimal paths. The optimality criteria are typically based on distances between vertices along a path, and the constructed strategies are deterministic.
To the best of our knowledge, the recurrent reachability objective functions represent the first generalpurpose specification language allowing to utilize the benefits of randomized strategies and even specify tradeoffs between performance and stochastic stability. Even in a broader context of stochastic programming, the existing works about balancing quantitative features of probabilistic strategies have so far been limited to some variants of mean payoff and limit frequencies [Brázdil et al.2017]. These results are not applicable in our setting due to a different technical setup. Furthermore, our specification language is not limited to frequencybased objectives.
2 The Model
In the rest of this paper, we use and
to denote the sets of nonnegative and positive integers. The set of all probability distributions over a finite set
is denoted by . A distribution over is positive if for all , and Dirac if for some .A finitestate Markov chain is a pair where is a finite set of states and
is a stochastic matrix where the sum of every row is equal to
. A bottom strongly connected component (BSCC) of is a maximal such that for all we have that can reach with positive probability (i.e., for some ), and every reachable from a state of belongs to .We assume familiarity with basic results of ergodic theory and calculus that are recalled at appropriate places.
2.1 Terrain model
The terrain is modeled as a finite directed graph where the vertices of correspond to robot’s positions, the edges are possible moves, and specifies the traversal time of an edge. Later, we adjoin additional parameters to edges and vertices modeling the costs, importance, etc.
We require that is strongly connected, i.e., for all there is a path from to . We write instead of .
2.2 Moving strategy
Let us fix a graph , and let be a nonempty set of memory states. Intuitively, memory states are used to “remember” some information about the history of visited vertices. Since infinitememory is not implementable, from now on we restrict ourselves to finitememory strategies.
A robot’s configuration is determined by the currently visited vertex and the current memory state . We use to denote the set of all configurations. A configuration of the form is written as .
When the robot visits a vertex, the next move is chosen randomly according to the current configuration. Formally, a moving strategy for with memory is a function such that only if . If is a Dirac distribution for every , we say that is deterministic.
Every moving strategy determines a Markov chain where is the set of states and . The edges of are defined by iff . Note that some edges may have zero probability, i.e., , but the BSCCs of are determined by the edges with positive probability (see the above definition of Markov chain). The traversal time of an edge is the same as in , i.e., equal to .
2.3 Recurrent reachability objectives
For the rest of this section, we fix a graph and a finite set of memory states.
2.3.1 Atomic expressions
We start by introducing basic expressions used to construct recurrent reachability objectives.
A walk in is an infinite sequence such that for all . Let . We say that hits in time if , where is the least index such that . If there is no such , we put . For all and , we use to denote the expected hitting time of by a walk initiated in (if , then ), and to denote the expected square of the hitting time of by a walk initiated in . Furthermore, for every walk and every edge , we define the frequency of along as the limit percentage of time spent by executing along , i.e.,
(1) 
where is the number of occurrences of in the prefix . It follows from basic results of Markov chain theory that the above limit is defined for almost all walks (i.e., with probability one), and almost all walks that hit the same BSCC of have the same .
2.3.2 Syntax
Recurrent reachability objective functions are closedform expressions over numerical constants and atomic expressions of the form , , , and obtained by using

addition, multiplication, min, and max that may take arbitrarily many arguments;

division, where the denominator is an expression over numerical constants and atomic expressions of the form , built using addition and multiplication.

other differentiable functions such as square root that are defined for all nonnegative arguments.
When defining the arguments of sums, products, min, and max, we may refer to special sets and consisting of active configurations and edges, respectively, whose semantics is defined in the next paragraph.
A recurrent reachability optimization problem is a problem of the form or , where is a recurrent reachability objective function.
2.3.3 Ergodicity
Infinitehorizon objective functions are typically independent of finite prefixes of runs and the initial configuration can be chosen freely. Hence, the objective value for actually depends only on the objective values attained in the “best” BSCC of . From now on, we only consider recurrent reachability objective functions satisfying this condition (this is equivalent to requiring that can be optimized by an ergodic strategy where is strongly connected).
2.3.4 Evaluation
Let be a moving strategy for with memory , and let be a BSCC of . For a given recurrent reachability function , we use to denote the value of in , defined by structural induction as follows:

atomic expressions , , , and are evaluated in the way described above ( is the probability of assigned by ). In particular, note that where can be positive only if .

The set of active configurations is equal to , and the set of active edges consists of all such that and .

The addition, multiplication, min and max are evaluated in the expected way. If the set of arguments is parametrized by or , it is constructed for the set of configurations and edges defined in the previous item.
In some cases, can be undefined (see Section 4).
Finally, we define the value of the objective or as the minimal or the maximal such that is a BSCC of where is defined (if there is no such , then the value in undefined).
3 Examples
To demonstrate the versatility of recurrent reachability objectives, we present selected examples of concrete objective functions. The list is by no means exhaustive—we show how to capture some of the existing infinitehorizon optimization criteria and how to extend them to control various forms of stochastic instability caused by randomization. Let us emphasize that our aim is to illustrate the expressive power of our optimization framework, not to design the most appropriate objectives capturing the discussed phenomena.
For this section, we fix a graph and a finite set of memory states. For a given subset , we use to denote the subset of configurations.
For simplicity, in the first two subsections we assume that the traversal time of every edge is .
3.0.1 Mean Payoff
As a warmup, consider the concept of mean payoff, i.e., the longrun average payoff per visited vertex. Let be a payoff function. The goal is to minimize the mean payoff for . This is formalized as minimize MP, where
MP  (2)  
(3) 
Here, is the set of all outgoing edges of . Hence, is the limit frequency of visits to .
Minimizing meanpayoff is computationally easy. However, the costs of vertices visited by an optimal strategy can significantly differ from the mean payoff (i.e., the costs are not distributed sufficiently “smoothly” along a walk). If “smoothness” is important, it can be enforced, e.g., by the objective , where is a suitable weight and
(4) 
is the standard deviation of the costs per visited vertex. Similarly, we can formalize objectives involving multiple cost/payoff functions and enforce some form of “combined smoothness” if appropriate.
3.0.2 Renewal Time
Let be a set of targets, and consider the problem of minimizing the maximal renewal time of a target. First, for every , let
(5) 
where is defined by (3). Hence, is the percentage of visits to among all visits to configurations of , assuming that the denominator is positive. Then, the renewal time of , denoted by is given by
(6) 
and the expected square of the time needed to revisit , denoted by , is expressible as
(7) 
Minimizing the maximal renewal time is then formalized as . However, this simple objective does not take into account possible deviations of the renewal time from its mean. This can be captured, e.g., by expressing the standard deviation as
(8) 
and using for a suitable weight .
3.0.3 Patrolling
Let be a set of targets, and a function assigning to every target its importance. The damage caused by an attack at a target (such as setting a fire) is given as where is the time to discover the attack by the robot. The patrolling objective is to minimize the expected damage.
Patrolling problems are studied for adversarial and nonadversarial
environments. In the first case, there is an active Attacker knowing the robot’s strategy and observing its moves. For the Attacker, an appropriate moment to initiate an attack is when the robot leaves a vertex and starts walking along some edge
. The objective is to minimize the expected damage over all possible attacks, i.e.,(9) 
Note that (9) conveniently uses the set of active edges to restrict the only to “relevant” edges used by the strategy.
In nonadversarial patrolling, the attacks are performed by “nature”. Let be a distribution over specifying the attack chance (such as the probability of spontaneous ignition). Then, minimizing the expected damage is expressible as minimize EDam, where the function EDam is defined as
(10) 
Note that if is attacked when the robot walks along , then the robot is in the middle of this edge on average. Hence, the average time to reach is .
Again, we can express the variances/deviations of the relevant random variables (incl. the variance of the expected damage of (10)). These expressions are relatively long, but their construction is straightforward.
4 The Algorithm
The algorithm is based on gradient descent: It starts with a random initial strategy and then repeatedly evaluates the objective function and modifies the current strategy in the direction of the gradient (or for minimization). After a number of iterations, the strategy attaining the best objective value is returned.
Let be a moving strategy for a graph . We show how to evaluate and differentiate for a given BSCC of . The atomic expressions are obtained as unique solutions of linear equations systems. More concretely, for a target set , active configurations and edges , we have that and are undefined for . If and , then . If , then for every we fix a variable and an equation
Then, the tuple of all , where , is the unique solution of this system.
Similarly, if , then the tuple of all , where , is the unique solution of the system where to each , we assign a variable and an equation
where .
To compute the edge frequencies, we fix a variable for every , and an equation
This system, together with the equation , has a unique solution where is the frequency of visits to . For each edge , we set and we get
For the other edges , we have that . The value of is then obtained from the atomic expressions in the straightforward way (when some atomic expression used in is undefined or the denominator of some fraction of is zero in , then is undefined).
Next, we need to calculate the gradient . As objectives are, by design, allowed to use only smooth operations with and over atomic expressions, the derivatives of w.r.t. these atomic expressions are well defined almost everywhere. Solutions of systems of linear equations depend smoothly on the parameters and the derivatives of our atomic expressions w.r.t.
can be calculated as solutions of other linear systems. We use PyTorch Library
[Paszke et al.2019] and its automatic differentiation in our implementation.However, a naïve update for a step size almost never yields a probability distribution (i.e., a valid strategy). The standard approach is to produce strategies from realvalued coefficients by a Softmax function. Any update of these coefficients then leads to a welldefined strategy. The drawback of Softmax is that the resulting distributions never contain zeros (i.e., the strategies always use all edges).
To reach all possible strategies, we cut the small probabilities at a certain threshold (and normalize) by Cutoff function. However, as edges with zero probabilities are excluded from , discontinuities in the objective may occur. For instance, in Patrolling objective (9), the term is present for an edge if and drops to zero if .
In other words, objective as a function of realvalued coefficients is a smooth function on an open set with possible jumps at the boundary. In order to make the boundary values accessible by the gradient descent, we relax our discontinuous to a smooth one, say . For instance, in Patrolling objective (9), we multiply by a factor
, which interpolates the values 0 and 1 continuously. Moreover, for a more efficient gradient propagation, we replace each
and with their relaxed variants as in [Klaska et al.2018].The final algorithm is described in Procedure 1. Strategy coefficients are initialized at random. In every step, we compute the relaxed objective value of the current strategy and update the coefficients based on the gradient using Adam optimizer [Kingma and Ba2015]. We also add a decaying Gaussian noise to the gradient. Then, we round the strategy using Cutoff and compute its true objective value . Procedure 1 can be run repeatedly to increase the chance of producing a strategy with better value.
The algorithm is efficient because it does not involve any computationally demanding tasks, but it does not guarantee (sub)optimality of the constructed strategies. This is unavoidable due to the hardness of some of the recurrent reachability objectives.
5 Experiments
There is no previous work setting the baseline for evaluating the quality of strategies produced by our algorithm. Therefore, we selected two representative examples where the tradeoffs between performance and stability are not easy to discover, but the functionality of the synthesized strategies can still be analyzed by hand and compared against the best achievable outcomes identified by theoretical analysis.^{2}^{2}2The code for reproducing the results is available at https://gitlab.fi.muni.cz/formela/2022ijcaioptimizationframework. See also the latest version at https://gitlab.fi.muni.cz/formela/regstar.
5.0.1 Mean Payoff
We minimize the objective defined in (2) and (4) for the graph of Fig. 2. The graph contains three cycles with the corresponding MP and DMP as in the table of Fig. 2.
For almost every , the objective’s minimizer is precisely one of the three cycles. More precisely, it is the cycle 7057 for all , the cycle 7657 for all , and the cycle 767 for all , where and . Ideally, our algorithm should find the corresponding cycle for every .
The algorithm outcomes are shown in Fig. 3. For every , we perform 100 trials (i.e., construct 100 strategies with one memory state), and report the corresponding value. The value of the best strategy achieved for a given is represented by a “circle”; the other “crosses” represent the values of nonoptimal strategies. Observe that the circles agree with the ideal outcomes.
5.0.2 Renewal time
Consider the graph of Fig. 4a. We minimize the objective defined in (6) and (8). The outcomes of our algorithm for two memory states are shown in Fig. 5. For each ranging from to , we run 100 trials and report the expected renewal time and the corresponding standard deviation of the resulting 100 strategies; the best values are highlighted by solid dots. The values of the obtained strategies are concentrated near the best one, showing the stability of the optimization.
For smaller , the constructed strategies have a smaller renewal time but large deviation, and they work as shown in Fig. 4b. That is, they tend to revisit from and from . As , the maximal renewal time approaches and the standard deviation approaches .
For larger , where the standard deviation is punished more severely, the algorithm tends to decrease , . Note that for , the strategy of Fig. 4b becomes deterministic with renewal time and zero deviation. However, this point is not reached by the curve of Fig. 5. The reason is that for , the algorithm discovers and strongly prefers a completely different strategy, which goes through instead (see Fig. 4c), with maximal renewal time and zero deviation (this is the best strategy with zero deviation).
6 Conclusions
The obtained optimization framework for recurrent reachability objectives is applicable not only to graphs, but also to Markov decision processes without any additional effort. Here, the randomization introduced by the strategies is combined with the “inherent randomization” of the model. Here, stochastic instability cannot be removed completely, and a precise understanding of the principal limits of this effort is a challenging direction for future research.
Acknowledgements
This work is supported by the Czech Science Foundation, Grant No. 2124711S, and from Operational Programme Research, Development and Education – Project Postdoc2MUNI No. CZ.02.2.69/0.0/0.0/18_053/0016952.
References

[Almeida et al.2004]
A. Almeida, G. Ramalho, H. Santana, P. Tedesco, T. Menezes, V. Corruble, and
Y. Chevaleyr.
Recent advances on multiagent patrolling.
Advances in Artificial Intelligence – SBIA
, 3171:474–483, 2004.  [Bhatia et al.2010] A. Bhatia, L.E. Kavraki, and M.Y. Vardi. Motion planning with hybrid dynamics and temporal goals. In Proceedings of 49th IEEE Conference on Decision and Control (CDC 2010), pages 1108–1115. IEEE Computer Society Press, 2010.
 [Brázdil et al.2017] T. Brázdil, K. Chatterjee, V. Forejt, and A. Kučera. Trading performance for stability in Markov decision processes. Journal of Computer and System Sciences, 84:144–170, 2017.
 [Choset2005] H.M. Choset. Principles of Robot Motion: Theory, Algorithms, and Implementation. MIT Press, 2005.
 [Huang et al.2019] L. Huang, M. Zhou, K. Hao, and E. Hou. A survey of multirobot regular and adversarial patrolling. IEEE/CAA Journal of Automatica Sinica, 6(4):894–903, 2019.
 [Kingma and Ba2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of ICLR 2015, 2015.
 [Klaska et al.2018] David Klaska, Antonín Kucera, Tomás Lamser, and Vojtech Rehák. Automatic synthesis of efficient regular strategies in adversarial patrolling games. In Elisabeth André, Sven Koenig, Mehdi Dastani, and Gita Sukthankar, editors, Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2018, Stockholm, Sweden, July 1015, 2018, pages 659–666. International Foundation for Autonomous Agents and Multiagent Systems Richland, SC, USA / ACM, 2018.
 [LaValle2006] S.M. LaValle. Planning Algorithms. Cambridge University Press, 2006.

[Paszke et al.2019]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban
Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan
Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith
Chintala.
Pytorch: An imperative style, highperformance deep learning library.
In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.  [Patrizi et al.2011] F. Patrizi, N. Lipovetzky, G. De Giacomo, and H. Geffner. Computing infinite plans for LTL goals using a classical planner. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2011), page 2003–2008, 2011.
 [Portugal and Rocha2011] D. Portugal and R. Rocha. A survey on multirobot patrolling algorithms. Technological Innovation for Sustainability, 349:139–146, 2011.
 [Puterman1994] M.L. Puterman. Markov Decision Processes. Wiley, 1994.
 [Sinha et al.2018] A. Sinha, F. Fang, B. An, C. Kiekintveld, and M. Tambe. Stackelberg security games: Looking beyond a decade of success. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2018), pages 5494–5501, 2018.
 [Smith et al.2011] S.L. Smith, J. Tůmová, C. Belta, and D. Rus. Optimal path planning for surveillance with temporallogic constraints. International Journal of Robotics Research, 30(14):1695–1708, 2011.
 [Toth and Vigo2001] P. Toth and D. Vigo. The Vehicle Routing Problem. SIAM Monographs on Discrete Mathematics and Applications. SIAM, 2001.
 [Ulusoy et al.2014] A. Ulusoy, S.L. Smith, X.C. Ding, C. Belta, and D. Rus. Optimality and robustness in multirobot path planning with temporal logic constraints. International Journal of Robotics Research, 32(8):889–911, 2014.
 [Wolff et al.2014] E.M. Wolff, U. Topku, and R.M. Murray. Optimizationbased trajectory generation with linear temporal logic specification. In Proceedings of ICRA 2014, page 5319–5325. IEEE Computer Society Press, 2014.
 [Yin et al.2010] Z. Yin, D. Korzhyk, C. Kiekintveld, V. Conitzer, and M. Tambe. Stackelberg vs. Nash in security games: Interchangeability, equivalence, and uniqueness. In Proceedings of AAMAS 2010, pages 1139–1146, 2010.