# On the Emergence of Shortest Paths by Reinforced Random Walks

The co-evolution between network structure and functional performance is a fundamental and challenging problem whose complexity emerges from the intrinsic interdependent nature of structure and function. Within this context, we investigate the interplay between the efficiency of network navigation (i.e., path lengths) and network structure (i.e., edge weights). We propose a simple and tractable model based on iterative biased random walks where edge weights increase over time as function of the traversed path length. Under mild assumptions, we prove that biased random walks will eventually only traverse shortest paths in their journey towards the destination. We further characterize the transient regime proving that the probability to traverse non-shortest paths decays according to a power-law. We also highlight various properties in this dynamic, such as the trade-off between exploration and convergence, and preservation of initial network plasticity. We believe the proposed model and results can be of interest to various domains where biased random walks and decentralized navigation have been applied.

## Authors

• 3 publications
• 7 publications
• ### Probabilistic Pursuits on Graphs

We consider discrete dynamical systems of "ant-like" agents engaged in a...
10/23/2017 ∙ by Michael Amir, et al. ∙ 0

• ### Estimating Shortest Path Length Distributions via Random Walk Sampling

In a network, the shortest paths between nodes are of great importance a...
06/05/2018 ∙ by Minhui Zheng, et al. ∙ 0

• ### The Undirected Two Disjoint Shortest Paths Problem

The k disjoint shortest paths problem (k-DSPP) on a graph with k source-...
09/11/2018 ∙ by Marinus Gottschau, et al. ∙ 0

• ### Safety in s-t Paths, Trails and Walks

Given a directed graph G and a pair of nodes s and t, an s-t bridge of G...
07/09/2020 ∙ by Massimo Cairo, et al. ∙ 0

• ### NetGAN: Generating Graphs via Random Walks

We propose NetGAN - the first implicit generative model for graphs able ...
03/02/2018 ∙ by Aleksandar Bojchevski, et al. ∙ 0

• ### After 100 Years, Can We Finally Crack Post's Problem of Tag? A Story of Computational Irreducibility, and More

Empirical, theoretical and historical aspects of Post's "problem of tag"...
03/11/2021 ∙ by Stephen Wolfram, et al. ∙ 0

• ### Network navigation using Page Rank random walks

We introduce a formalism based on a continuous time approximation, to st...
07/15/2020 ∙ by Emilio Aced Fuentes, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The interplay between network structure (nodes, edges, weights) and network function (high level features enabled by the network) is a fundamental and challenging problem present in a myriad of systems ranging from biology to economics and sociology. In many complex systems network structure and network function co-evolve interdependently: while network structure constraints functional performance, the drive for functional efficiency pressures the network structure to change over time. Within this tussle, network activity (i.e., basic background processes running on the network) plays a key role in tying function and structure: in one hand, function execution often requires network activity, while in the other hand network structure often constraints network activity.

Given the complexity of co-evolution, simple and tractable models are often used to understand and reveal interesting phenomena. In this paper, we focus on network navigation, proposing and analyzing a simple model that captures the interplay between function and structure. Our case-study embodies repetition, plasticity, randomization, valuation and memory which are key ingredients for evolution: repetition and memory allow for learning; plasticity and randomization for exploring new possibilities; valuation for comparing alternatives. Moreover, in our case-study co-evolution is enabled by a single and simple network activity process: biased random walks, where time-varying edge weights play the role of memory.

Network navigation (also known as routing) refers to the problem of finding short paths in networks and has been widely studied due to its importance in various contexts. Efficient network navigation can be achieved by running centralized or distributed algorithms. Alternatively, it can also be achieved when running simple greedy algorithms over carefully crafted network topologies. But can efficient navigation emerge without computational resources and/or specifically tailored topologies?

A key contribution of our work is to answer affirmatively the above question by means of Theorem 1, which states that under mild conditions efficient network navigation always emerges through the repetition of extremely simple network activity. More clearly, a biased random walk will eventually only take paths of minimum length, independently of network structure and initial weight assignment. Beyond its long term behavior, we also characterize the system transient regime, revealing interesting properties such as the power-law decay of longer paths, and the (practical) preservation of initial plasticity on edges far from ones on the shortest paths. The building block for establishing the theoretical results of this paper is the theory of Pólya urns, applied here by considering a network of urns.

We believe the proposed model and its analysis could be of interest to various domains where some form of network navigation is present and where random walks are used as the underlying network activity, such as computer networking [16, 8, 14], animal movement in biology [3, 21], memory recovery in the brain [18, 1, 19]

. Moreover, our results can enrich existing theories such as Ant Colony Optimization (ACO) meta-heuristic

[6, 5]

, Reinforcement Learning (RL) theory

[22], and Edge Reinforced Random Walks (ERRW) theory [4, 17] – see related work in Section 3.

## 2 Model

We consider an arbitrary (fixed) network , where is a set of vertices and is a set of directed edges among the vertices. We associate a weight (a positive real value) to every directed edge . Edge weights provide a convenient and flexible abstraction for structure, specially when considering evolution. Finally, a pair of fixed nodes , are chosen to be the source and destination, respectively. But how to go from to ?

We adopt a very simple network activity model to carry out the function of navigation: weighted random walks (WRW). Specifically, a sequence of random walks, indexed by , is executed on the network one after the other. Each WRW starts at and steps from node to node until it hits . At each visited node, the WRW randomly follows an outgoing edge with probability proportional to its edge weight. We assume that weights on edges remain constant during the execution of a single WRW, and that decisions taken at different nodes are independent from each other.

Once the WRW reaches the destination and stops, edge weights are updated, thus impacting the behavior of the next WRW in the sequence. In particular, edges on the path followed by the WRW are rewarded (reinforced) by increasing their weights with a positive amount which depends on the length of the path taken (expressed in number of hops). Let be some positive function of the path length, hereinafter called the reward function.

We consider two different ways in which edges are reinforced:

• single-reward model: each edge belonging to the path followed by the WRW is rewarded once, according to function .

• multiple-reward model: each edge belonging to the path followed by the WRW is rewarded according to function for each time the edge was traversed.

Throughout the paper, we will interpret , the number of random walks that have gone from to as a discrete time step. Thus, by co-evolution of the system we actually mean what happens to the network structure (i.e., weights) and navigation (i.e., path lengths) as .

Let be the weight on edge at time (right after the execution of the -th WRW but before the -th WRW starts). Let denote the sequence of edges (i.e., the path) traversed by the -th WRW and the path length (in number of hops)111In this paper we assume that the cost to traverse any edge is equal to 1, but results can be immediately generalized to the case in which a generic (positive) cost is associated to each edge ..

After reaching the destination, the weight of any distinct edge in is updated according to the rule:

 wi,j[n]=wi,j[n−1]+ui,j(Pn)⋅f(Ln) (1)

where under the single-reward model, whereas equals the number of times edge appears in , under the multiple reward model. We also allow for the event that the WRW does not reach the destination because it ‘gets lost’ in a part of the network from which is no longer reachable. In this case, we assume that no edge is updated by the WRW who fails to reach .

Note that our model has the desirable ingredients for co-evolution: edge set and initial weights provide plasticity and WRW provides randomization, which allows for exploring alternative paths; edge weights provide memory and the sequence of WRW provides repetition, which enables learning; path length taken by WRW provides valuation, which allows for comparing alternatives paths. Moreover, note that functional performance induces structural changes through network activity as navigation (traversed path) changes edge weights, while network structure constraints function, as edge weights influence observed path lengths. Thus, our model captures the essence of co-evolution. But will efficient navigation emerge? In particular, which paths are taken as increases?

## 3 Related work

The problem of finding shortest paths in networks is, of course, a well understood problem in graph theory and computer science, for which efficient algorithms are available, both centralized (e.g., Dijkstra) and distributed (e.g., Bellman-Ford). Our approach follows in the second category (distributed), as it does not require knowledge of the topology, however we stress that our goal is not to propose yet another way to compute shortest paths in network (actually, the convergence of our process is slower than that of Bellman-Ford), but to show that shortest paths can naturally emerge from the repetition of a simple and oblivious network activity which does not require computational/memory resources on the nodes. As such, our model is more tailored to biological systems, rather than technological networks.

The celebrated work of Kleinberg [13] was probably the first to show that efficient navigation is indeed feasible through a simple greedy strategy based solely on local information, but under the stringent assumption that the network exhibits a very particular structure. Greedy algorithms can also lead to efficient network navigation under distributed hash tables (DHTs), but again this requires the network to exhibit a very particular topology [10].

The idea of reinforcing edges along paths followed by random walks is surely reminiscent of Ant Colony Optimization (ACO), a biologically-inspired meta-heuristic for exploring the solution space of complex optimization problems which can be reduced to finding good paths through graphs  [6, 5]. Although some versions of ACO can be proved to converge to the global optimum, their analysis turns out to be complicated and mathematically non-rigorous, especially due to pheromone evaporation

(i.e., weights on edges decrease in the absence of reinforcement). Moreover, like most meta-heuristics, it is very difficult to estimate the theoretical speed of convergence. In contrast to ACO, our model is simpler and has the modest goal of revealing shortest paths in a given network, instead of exploring a solution space. Moreover, we do not introduce any evaporation, and we exploit totally different techniques (the theory of Pólya urns) to establish our results, including the transient behavior (convergence) of the system.

In Reinforcement Learning (RL), the problem of finding an optimal policy through a random environment has also been tackled using Monte Carlo methods that reinforce actions based on earned rewards, such as the -soft policy algorithm [22]. Under a problem formulation with no terminal states and expected discounted rewards, it can be rigorously shown that an iterative algorithm converges to the optimal policy [23]. However, in general and more applicable scenarios, the problem of convergence to optimal policies is still an open question, with most algorithms settling for an approximate solution. Although lacking the notion of action set, our model is related to RL in the sense that it aims at finding paths accumulating the minimum cost, through an unknown environment, using a Monte-Carlo method. Our convergence results (convergence in probability) and the techniques used in the analysis (Pólya urns) are fundamentally different from what is commonly found in RL theory, and could thus be useful to tackle problems in this area.

Edge Reinforcement Random Walks (ERRW) is a mathematical modeling framework consisting of a weighted graph where weights evolve over time according to steps taken by a random walker [4, 17]. In ERRW, a single random walk moves around without having any destination and without being restarted. Moreover, an edge weight is updated immediately after traversal of the edge, according to functions based on local information. Mathematicians have studied theoretical aspects of ERRW such as the convergence of the network structure (relative weights) and the recurrence behavior of the walker (whether it will continue to visit every node in the long run, or get trapped in one part). Similarly to our model, a key ingredient in the analysis of ERRW is the Pólya urn model, specially on directed networks. In contrast to us, ERRW model was not designed to perform any particular function and thus does not have an objective. Our model is substantially different from traditional ERRW, and we believe it could suggest a concrete application as well as new directions to theoreticians working on ERRW.

Animal movement is a widely studied topic in biology to which probabilistic models have been applied, including random walk based models [3, 21]. In particular, in the context of food foraging, variations of ERRW models have been used to capture how animals search and traverse paths to food sources. A key difference in such variations is a

direction vector

, an information external to the network (but available on all nodes) that provides hints to the random walk. Such models have been used to show the emergence of relatively short paths to food sources, as empirically observed with real (monitored) animals. In contrast, we show that shortest paths (and not just short) can emerge even when external information is not available.

Understanding how neurons in the brain connect and fire to yield higher level functions like memory and speech is a fundamental problem that has recently received much attention and funding

[20, 9]. Within this context, random walk based models have been proposed and applied [19, 1] along with models where repeated network activity modifies the network structure [18]. In particular, the latter work considers a time varying weighted network model under a more complex rule (than random walks) for firing neurons to show that the network structure can arrange itself to perform better function. We believe our work can provide building blocks in this direction since our simple model for a time varying (weighted) network also self-organizes to find optimal paths.

Biased random walks have also been applied to a variety of computer networking architectures [16, 8, 14], with the goal of designing self-organizing systems to locate, store, replicate and manage data in time-varying scenarios. We believe our model and findings could be of interest in this area as well.

## 4 Main finding

Let be the length of the shortest path in graph connecting source node to destination node . Denote by the path taken by the -th WRW, and by an arbitrary path from to , of length .

###### Theorem 1.

Given a weighted directed graph , a fixed source-destination pair - (such that is reachable from ), an initial weight assignment (such that all initial weights are positive), consider an arbitrary path from to . Under both the single-reward model and the multiple-reward model, provided that the reward function is a strictly decreasing function of the path length, as the number of random walks performed on the graph tends to infinity, we have:

 limn→∞P{Pn=P}={c(P),if LP=Lmin0,if LP>Lmin

where

is a random variable taking values in

, that depends on the specific shortest path .

The above theorem essentially says that all shortest paths are taken with non-vanishing probability, while all non-shortest paths are taken with vanishing probability, as . Note however that the probability that a specific shortest path is taken is a random variable, in the sense that it depends on the ‘system run’ (system sample path).

###### Remark 1.

The asymptotic property stated in Theorem 1 is very robust, as it holds for any directed graph, any strictly decreasing function , and any (valid) initial weights on the edges. Note instead that the (asymptotic) distribution of , for a given shortest path , as well as the convergence rate to it, depends strongly on the update function , on the graph structure, and on the initial conditions on the edges.

###### Remark 2.

We will see in the proof of of Theorem 1 that the assumption of having a strictly decreasing function can be partially relaxed, allowing the reward function to be non-increasing for .

## 5 Preliminaries

### 5.1 Definitions

The following definitions for nodes and edges play a central role in our analysis.

###### Definition 1 (decision point).

A decision point is a node , reachable by , that has more than one outgoing edge that can reach .

###### Remark 3.

Clearly, we can restrict our attention to nodes that are decision points, since all other nodes are either never reached by random walks originating in , have zero or one outgoing edge (having no influence on the random walk behavior), or their outgoing edges are never reinforced since the destination cannot be reached from them.

###### Definition 2 (α-edge and β-edge).

An outgoing edge of decision point is called an -edge if it belongs to some shortest path from to , whereas it is called a -edge if it does not belong to any shortest path from to .

Note that every outgoing edge of a decision point is either an -edge or -edge. Let denote the probability that the random walk, at time , takes a shortest path from to after traversing the -edge . Let denote the probability that the random walk, at time , will not return back to node after traversing the -edge . Note that the above probabilities depend, in general, on the considered edge, on the network structure and on the set of weights at time .

###### Definition 3 (α∗-edge and β∗-edge).

An -edge is an -edge such that, after traversing it, the random walk takes a shortest path to with probability 1, and thus . A -edge is a -edge such that, after traversing it, the random walk does not return to node with probability 1, and thus .

Note that -edge and -edge can occur due solely to topological constraints. In particular, we have an -edge whenever the random walk, after traversing the edge, can reach only through paths of minimum length. In a cycle-free network, all -edges are necessarily -edges.

### 5.2 The single decision point

As a necessary first step, we will consider the simple case in which there is a single decision point in the network. The thorough analysis of this scenario provides a basic building block towards the analysis of the general case.

We start considering the simplest case in which there are two outgoing edges (edge 1 and edge 2) from the decision point, whose initial weights are denoted by and , respectively. Let and denote the (deterministic) length of the path experienced by random walks when traversing edge 1 and edge 2, respectively. Correspondingly, let and denote the rewards given the edge 1 and edge 2, respectively.

The mathematical tool used here to analyze this system, especially its asymptotic properties, are Pólya urns [15]. The theory of Pólya urns is concerned with the evolution of the number of balls of different colors (let be the number of colors) contained in an urn from which we repeatedly draw one ball uniformly at random. If the color of the ball withdrawn is , , then balls of color are added to the urn, , in addition to the ball withdrawn, which is returned to the urn. In general, can be deterministic or random, positive or negative. Let be the matrix with entries , usually referred to as the schema of the Pólya urn.

We observe that a decision point can be described by a Pólya urn, where the outgoing edges represent colors, the edge weight is the number of balls in the urn222Although Pólya urn models have been traditionally developed considering an integer number of balls for each color, analogous results hold in the case of real numbers, when all are positive (as in our case)., and entries correspond to edge reinforcements according to taken path lengths (through function ). In the simple case with only two edges, we obtain the following schema:

 A=(Δ100Δ2) (2)

We first consider the situation in which , which occurs when both edges are part of a shortest path, and thus, both edges are

-edges. A classical result in Pólya urns states that the normalized weight of edge 1 (similarly for edge 2), i.e., the weight on edge 1 divided by the sum of the weights, tends in distribution to a beta distribution:

 w1[n]w1[n]+w2[n]D→β(w1[0]Δ1,w2[0]Δ2) (3)

Note that in this simple case the above beta distribution completely characterizes the asymptotic probability of traversing the shortest path comprising edge 1 (or edge 2). Hence, we obtain a special case of the general result stated in 1, where the random variable is a beta distribution which depends both on the update function and the initial weights. Informally, we say that both shortest paths will always ‘survive’, as they will be asymptotically used with a (random) non-zero probability, independent of the sample path taken by the system.

The above result can be directly generalized to the case of outgoing edges, all belonging to shortest paths. Indeed, let’s denote the asymptotic normalized weight of edge by :

 ri=limn→∞wi[n]∑Kj=1wj[n]

Moreover, let

. Then it is known that the joint probability density function of the

’s tends to a Dirichlet distribution with parameters .

A useful property of the Dirichlet distribution is aggregation: if we replace any two edges with initial weights , by a single edge with initial weight , we obtain another Dirichlet distribution where the ‘combined’ edge is associated to parameter , i.e., if then . Note that the marginal distribution with respect to any of the edges is, as expected, a beta distribution, i.e.,

 ri∼β⎛⎝αi,K∑j=1,j≠iαi⎞⎠

Let’s now consider outgoing edges that lead to paths of different lengths, starting from the simple situation in which we have just two edges. Without lack of generality, let’s assume that in which case edge 1 is an -edge and edge 2 is a -edge. The analysis of the corresponding Pólya urn model uses a technique known as Poissonization [15]

. The basic idea is to embed the discrete-time evolution of the urn in continuous time, associating to each ball in the urn an independent exponential timer with parameter 1. When a timer ‘fires’, the associated ball is drawn, and we immediately perform the corresponding ball additions (starting a new timer for each added ball). The memoryless property of the exponential distribution guarantees that the time at which a ball is drawn is a renewal instant for the system. Moreover, competition among the timers running in parallel exactly produces the desired probability to extract a ball of a given color at the next renewal instant. This means that, if

is the (continuous) time at which the -th timer fires, at time the number of balls in the continuous-time system has exactly the same distribution as the number of balls in the original discrete-time system after draws. It follows that the asymptotic behavior (as ) of the continuous-time system coincides with the asymptotic behavior of the discrete-time system (as ), but the continuous-time system is more amenable to analysis, thanks to the independence of all Poisson processes in the urn.

The Poissonization technique leads to the following fundamental result: Let be the (column) vector of edge weights at time in the continuous-time system. We have (Theorem 4.1 in [15]): . The above result can be extended to the case in which the entries of schema are independent random variables (independent among them and from one draw to another) by simply substituting with :

 E[w(t)]=eE[AT]tw(0) (4)

i.e., by considering a schema in which random entries are replaced by their expectations. This extension will be particularly useful in our context.

## 6 Asymptotic analysis

In this section we prove Theorem 1 first constrained to directed acyclic graphs (DAG), then relaxing to general topologies under the multiple-reward model and finally to the single-reward model.

### 6.1 The DAG case

Let be a directed acyclic graph (DAG) and note that in this case edges are either -edges or -edges. Moreover, the absence of cycles forbids traversing an edge more than once, so the single-reward model coincides with the multiple-reward model.

We first introduce the following key lemma.

###### Lemma 1.

Consider a decision point having one or more -edges and one or more -edges. The normalized weight of any -edge vanishes to zero as .

###### Proof.

Let denote the length of the shortest path from the decision point to . Note that this path length is realized by the random walk after following an -edge. Observe that -edges can be merged together into a single virtual -edge whose weight, denoted by , is defined as the sum of the weights of the merged -edges. Similarly, we will merge all -edges into a single virtual -edge of weight , defined as the sum of the weights of the merged -edges.

Let be the stochastic process corresponding to , i.e., is the normalized weight of the virtual merged -edge after the -th random walk. We are going to show that which implies that the asymptotic probability to follow any -edge goes to zero as well. The proof is divided into two parts. First, we show that exists almost surely, namely, converges to a given constant . Second, we will show that can only be equal to 0. For the first part, we will use Doob’s Martingale Convergence Theorem [7], after proving that is a super-martingale. Since is discrete time, and , it suffices to prove that , where the filtration corresponds to all available information after the -th walk. Now, the normalized weight, at time , of any -edge is stochastically dominated by the normalized weight, at time , of the same -edge assuming that it belongs to a path of length . This is essentially the reason why we can merge all -edges into a single virtual -edge belonging to a path of length . Hence, , where is the aggregate normalized weight of the virtual -edge. We proceed by considering what can happen when running the -th walk. Two cases are possible: i) either the random walk does not reach the decision point, in which case since edge weights are not updated, or ii) it reaches the decision point having accumulated a (random) hop count . In the second case, we can further condition on the value taken by and prove that , :

 E[Z′n+1|Fn,ℓn+1]=Zn˙w(n)+f(ℓn+1+^L+1)˙w(n)+^w(n)+f(ℓn+1+^L+1)+(1−Zn)˙w(n)˙w(n)+^w(n)+f(ℓn+1+^L)=Zn[˙w(n)+f(ℓn+1+^L+1)˙w(n)+^w(n)+f(ℓn+1+^L+1)+^w(n)˙w(n)˙w(n)˙w(n)+^w(n)+f(ℓn+1+^L)]≤Zn[˙w(n)+f(ℓn+1+^L+1)+^w(n)˙w(n)+f(ℓn+1+^L+1)+^w(n)]=Zn (5)

where the inequality holds because is assumed to be non-increasing.

At last, unconditioning with respect to , whose distribution descends from , and considering also the case in which the random walk does not reach the decision point, we obtain and thus . So far we have proven that converges to a constant . To show that necessarily , we employ the Poissonization technique recalled in Section 5.2, noticing again that is stochastically dominated by . For the process , we have:

 AT=(f(ℓ+^L)00f(ℓ+^L+1))

where is the (random) hop count accumulated at the decision point. We will show later that the normalized weight of any edge in the network converges asymptotically almost surely. Hence, has a limit distribution, that we can use to compute expected values of the entries in the above matrix:

 E[AT]=(Eℓ[f(ℓ+^L)]00Eℓ[f(ℓ+^L+1)])=(cca00d)

obtaining that when is strictly decreasing. At this point, we can just apply known results of Pólya urns’ asymptotic behavior (see Theorem 3.21 in [11]), and conclude that the normalized weight of the -edge must converge to zero. Alternatively, we can apply (4) and observe that in this simple case

 (E[^w(t)]E[˙w(t)])=(eat00edt)(^w(0)˙w(0)) (6)

Therefore the (average) weight of the -edge increases exponentially faster than the (average) weight of the -edge. ∎

Lemma 1 provides the basic building block to prove Theorem 1.

###### Proof of Theorem 1 (DAG case).

We sequentially consider the decision points of the network according to the partial topological ordering given by the hop-count distance from the destination. Simply put, we start considering decision points at distance 1 from the destination, then those at distance 2, and so on, until we hit the source node . We observe that Lemma 1 can be immediately applied to decision points at distance 1 from the destination. Indeed, these decision points have one (or more) -edge, with , connecting them directly to , and zero or more -edges connecting them to nodes different from . Then, Lemma 1 allows us to conclude that, asymptotically, the normalized weight of the virtual -edge will converge to 1, whereas the normalized weight of all -edges will converge to zero. This fact essentially allows us to prune the -edges of decision nodes at distance 1, and re-apply Lemma 1 to decision points at distance 2 (and so on). Note that after the pruning, an -edge of a decision point at distance 2 necessarily becomes an -edge. As a consequence of the progressive pruning of -edges, we remove from the graph all edges which do not belong to shortest paths from a given node to (when we prune a -edge, we contextually remove also edges that can only be traversed by following the pruned edge, and notice that by so doing we can also remove some -edge).

When the above iterative procedure hits the source node , we are guaranteed that only shortest paths from to remain in the residual graph (and all of them). As a consequence, over the residual graph, a random walk starting from can only reach through a shortest path. Note that the normalized weight of any edge belonging to a shortest path will converge to a random variable bounded away from zero. Hence the asymptotic probability to follow any given shortest path , given by the product of normalized weights of its edges, will converge as well to a a random variable bounded away from zero. Conversely, any path which is not a shortest path cannot ‘survive’. Indeed, any such path must traverse at least one decision point and take at least one -edge. However, the above iterative procedure will eventually prune all -edges belonging to the considered non-shortest path, which therefore cannot survive. ∎

### 6.2 The multiple reward model in general network

We now consider the case of an arbitrary directed graph possibly with nodes exhibiting (even multiple) self-loops. Moreover, we first focus on the multiple-reward model which is more challenging to analyze, and discuss the single-reward model in Section 6.3.

Essentially, we follow the same reasoning as in the DAG case, by first proving a generalized version of Lemma 1.

###### Lemma 2.

Consider a decision point having one or more -edges and one or more -edges. The normalized weight of any -edge vanishes to zero as .

###### Proof.

Similarly to the proof of Lemma 1, we merge all -edges into a single virtual -edge with total weight . Moreover, we merge all -edges into a single virtual -edge with weight , defined as the sum of the weights of the merged -edges. Such virtual -edge can be interpreted as the best adversary against the virtual -edge. Clearly, the best -edge is an outgoing edge that (possibly) brings the random walk back to the decision point over the shortest possible cycle, i.e., a self-loop. It is instead difficult, a priori, to establish which is the best possible value of its parameter , i.e., the probability (in general dependent on ) the makes the virtual -edge the best competitor of the virtual -edge. Therefore, we consider arbitrary values of (technically, if then the -edge cannot be a self-loop, but we optimistically assume that loops have length 1 even in this case). In the following, to ease the notation, let . Similarly to the DAG case, we optimistically assume that if the random walk reaches the destination without passing through the -edge, the overall hop count will be , where is the hop count accumulated when first entering the decision point, while denotes the number of (self) loops. Instead, if the random walk reaches the destination by eventually following the -edge, the overall hop count will be . In any real situation, the normalized cumulative weight of -edges is stochastically dominated by the weight of the virtual best adversary, having normalized weight . We have:

 E[Z′n+1|Fn,ℓn+1]=Zn[∞∑i=0[(1−q)Zn]i(^w(n)˙w(n)˙w(n)+iΔ(i)˙w(n)+iΔ(i)+^w(n)+Δ(i)+q˙w(n)+iΔ′(i)˙w(n)+iΔ′(i)+^w(n))] (7)

where and .

Now, it turns out that the term in square brackets of the latter expression is smaller than or equal to one for any value of , , , , and non-increasing function . This property can be easily checked numerically, but a formal proof requires some effort (see App. A). As a consequence, . At last, unconditioning with respect to , whose distribution descends from , and considering also the case in which the -th random walk does not reach the decision point, we obtain and thus . Hence, we have that converges to a constant .

To show that necessarily , we employ the Poissonization technique as in Section 5.2, noticing again that is stochastically dominated by . For the process , we have:

 E[AT]=(ab0d) (8)

The entries in the above matrix have the following meaning:

• is the average reward given to the -edge if we select the -edge;

• is the average reward given to the -edge if we select the -edge;

• is the average reward given to the -edge if we select the -edge;

Note that the average reward given to the -edge if we select the -edge is zero.

Luckily, the exponential of a 2x2 matrix in triangular form is well known [2] (see also [12] for limit theorems of triangular Pólya urn schemes). In particular, when we obtain:

 (E[^w(t)]E[˙w(t)])=(eatbd−a(edt−eat)0edt)(^w(0)˙w(0)) (9)

The special case in which will be considered later (see Section 6.3).

To show that necessarily , we reason by contradiction, assuming that converges to . This implies that

 ˙w(t)=z1−z^w(t)+o(^w(t)) (10)

Moreover, we will assume that a large enough number of walks has already been performed such that, for all successive walks, the probability to follow the -edge is essentially equal to . Specifically, let be a large enough time step such that the normalized weight of the -edge is for all . We can then ‘restart’ the system from time , considering as initial weights and (the specific values are not important).

Taking expectation of (10) and plugging in the expressions of the average weights in (9), we have that the following asymptotic333Given two functions and , we write if . relation must hold:

 edt˙w(n∗)∼ez1−z(eat^w(n∗)+bd−a(edt−eat)˙w(n∗))

Clearly, the above relation does not hold if . If , the relation is satisfied when

Interestingly, we will see that (11) is verified when the reward function is constant, suggesting that in this case the -edge can indeed ‘survive’ the competition with the -edge. Instead, we will show that , for any strictly decreasing function , proving that the normalized weight of the -edge cannot converge to any .

For simplicity, we will consider first the the case in which , the hop count accumulated by the random walk while first entering the decision point, is not random but deterministic and equal to . Under the above simplification, we have:

 a= f(ℓ+^L) (12) b= (1−q)(1−z)∞∑i=0[z(1−q)]if(ℓ+i+1+^L) (14) d= ∞∑i=0[z(1−q)]i[q(i+1)f(ℓ+i+1+^L)+ (1−q)(1−z)(i+1)f(ℓ+i+1+^L)]

In the special case in which the reward function is constant (let this constant be ), we obtain:

 a= C (15) b= C(1−q)(1−z)1−z+qz (16) d= C11−z+qz (17)

It is of immediate verification that (15),(16),(17) satisfy (11) for any (the case corresponds to having , which is considered separately in Section 6.3).

To analyze what happens when is a decreasing function, we adopt an iterative approach. We consider a sequence of reward functions , indexed by , defined as follows. Let be the minimum path length experience by random walks traversing the decision point. We define:

 fk(L+i)={f(L+i)if 0≤i≤kf(L+k)if i>k (18)

In words, function matches the actual reward function up to hop count , while is takes a constant value (equal to for larger hop count. See Figure 1.

In our proof, we will actually generalize the result in Theorem 1, allowing the reward function to be non-increasing for values larger than . To simplify the notation, let . For , let , with , and .

Let () be the entries of matrix (8) when we assume that rewards are given to edges according to function (function ), with . As a first step, we can show that (11) does not hold already for , i.e., for a reward function which is equal to for hop count , and equal to for any . Indeed, in this case we have:

 a0 =C b0 =(1−q)(1−z)∞∑i=0[z(1−q)]i(C−δ1) =(1−q)(1−z)1−z+zq(C−δ1) d0 =(1−z+zq)∞∑i=0[z(1−q)]i(i+1)(C−δ1) =11−z+zq(C−δ1)

It can be easily check that for any .

To show that (11) cannot hold for the actual reward function , it is then sufficient to prove the inductive step

 bkdk−ak≤bk+1dk+1−ak+1

Note, indeed, that the sequence of functions tends point-wise to . Now, for any for which there is nothing to prove, since in this case . So let’s suppose that .

We have . We can write as:

 bk = ^b+(1−q)(1−z)∞∑i=k+1[z(1−q)]i(C−δk) =^b+(1−q)(1−z)(C−δk)[z(1−q)]k+11−z+zq

where

 ^b=(1−q)(1−z)k∑i=0[z(1−q)]i(C−δi+1)

We can write as:

 bk+1 = ^b+(1−q)(1−z)∞∑i=k+1[z(1−q)]i(C−δk+1) =^b+(1−q)(1−z)(C−δk+1)[z(1−q)]k+11−z+zq

Similarly, we have:

 dk = ^d+(C−δk)[z(1−q)]k+1(k+1+11−z+zq) dk+1 = ^d+(C−δk+1)[z(1−q)]k+1(k+1+11−z+zq)

where

 ^d=k∑i=0[z(1−q)]i(i+1)(1−z+zq)(C−δi+1)

We will assume that both and , otherwise the result is trivial (if , then also , since , . If , the normalized ratio of the -edge can only tend to zero). Under this assumption, we can show that

 bkdk−ak

Indeed, plugging in the expressions of , after some algebra we reduce inequality (19) to:

 ^b[(k+1)(1−z+zq)+1]+C(1−q)(1−z)>^d(1−q)(1−z)

At last, recalling the definitions of and , we obtain that the above inequality is satisfied if

 k∑i=0[z(1−q)]i[(k+1)(1−z+zq)+1](C−δi+1)> k∑i=0[z(1−q)]i(i+1)(1−z+zq)(C−δi+1)

which is clearly true, since when varies from 0 to .

We now provide a sketch of the proof for the case in which the random walk arrives at the decision point having accumulated a random hop count

. After long enough time, we can assume that the probability distribution of

has converged to a random but fixed distribution that no longer depends on . Indeed, such distribution depends only on normalized edge weights, which in the long run converge to constant values. Let , , where is the minimum hop count that can be accumulated at the decision point. We can use to compute expected values of , , as defined in (12), (14), (14), and apply again the Poissonization technique to compute asymptotic values of edge weights.

Specifically, letting , we obtain:

 E[a]= ∞∑m=0pm(C−δm) E[b]= (1−q)(1−z)∞∑m=0∞∑i=0[z(1−q)]i(C−δm+i+1) E[d]= (1−z+zq)∞∑m=0∞∑i=0[z(1−q)]i(i+1)(C−δm+i+1)

Similarly to before, we prove by contradiction that (11) cannot hold, through an iterative approach based on the sequence of reward functions . As basic step of the induction, we take the reward function equal to for any hop count larger than . Hence, we have , . It follows that is exactly the same as in (14), and is exactly the same as in (14). The only quantity that is different is , where as long as . Therefore, whenever there is a non-null probability to reach the decision point with minimum hop count, the basic induction step proven before still holds here, by redefining and as and , respectively. One can also prove that the generic iterative step still holds, by following the same lines as in the basic case. Indeed, one can verify that

when . This concludes the proof of Lemma 2. ∎

###### Proof of Theorem 1 (general case).

The proof is exactly the same as in the DAG case, with the difference that we employ Lemma 2 instead of Lemma 1 to iteratively prune -edges from the decision points, leaving only paths from to of minimum length. ∎

### 6.3 The single reward model in general network

We conclude the asymptotic analysis considering the single-reward model in a general directed network. Given the analysis for the multiple-reward model, the single-reward model is almost immediate. Indeed, the expressions for

and (respectively in (12) and (14)) are left unmodified, as well as their averages and with respect to hop count accumulated at the decision point. Instead, we have

 E[d]=(1−z+zq)∞∑m=0∞∑i=0[z(1−q)]i(C−δm+i+1) (20)

which is clearly smaller than the obtained under the multiple-reward model. Hence, the basic step of the induction used to prove Lemma 2 follows immediately from the consideration that is an increasing function of . Moreover, simple algebra shows that the iterative step holds also in the case of single-reward, allowing us to extend the validity of Lemma 2, and thus Theorem 1.

Last, it is interesting to consider the case of single reward model and constant reward function, . We have in this case:

 a= C (21) b= C(1−q)(1−z)1−z+qz (22) d= C (23)

Since , the matrix exponential takes a different form with respect to 9, that now reads:

 (E[^w(t)]E[˙w(t)])=eat(1b01)(^w(0)˙w(0)) (24)

We can show by contradiction that the normalized weight of the -edge cannot tend to any . Indeed, assuming to restart the system after a long enough number of walks such that , we should have:

 eat˙w(n∗)∼ez1−z(eat^w(n∗)+eatb˙w(n∗))

which can only be satisfied if . Interestingly, equals 0 when , i.e., when the -edge becomes a -edge. This means that, asymptotically, the probability that the random walk makes any loop must vanish to zero. We conclude that, in the case of a constant single reward model, many paths can survive (including non-shortest paths), but not those containing loops. In other words, surviving edges must belong to a DAG. Simulation results, omitted here due to lack of space, confirm this prediction.

a

## 7 Transient analysis

Beyond the asymptotic behavior, it is interesting to consider the evolution of edge weights over time. In particular, since all non-shortest paths are taken with vanishing probability, what law governs the decay rate of such probabilities? How does the decay rate depend on system parameters, such as network topology and reward function? Such questions are directly routed to analogous questions regarding how normalized edge weights evolve over time, as the probability of taking a given path is simply the product of the probabilities of taking its edges. Thus, we investigate the transient behavior of normalized edge weights.

### 7.1 Single decision point

We again start by considering the case of a single decision point with two outgoing edges (edge 1 and edge 2), whose initial weights are denoted by and , respectively. Let and be the rewards associated to edge 1 and edge 2, and the corresponding path lengths.

As discussed in Section 5.2, the dynamics of this discrete time system can be usefully embedded into continuous time using the Poissonization technique, which immediately provides the transient behavior of the system in the simple form (4). To complete the analysis, the solution in continuous time should be transformed back into discrete time . Unfortunately, this operation can be done exactly only in the trivial case of just one edge. With two (or more) edges, we can resort to an approximate (yet quite accurate) heuristic called depoissonization, which can be applied to all Pólya urn models governed by invertible ball addition matrices [15]. In this simple topology, assuming , the approximation consists in assuming that ball all extractions that have occurred by time are associated to the winning edge only (this becomes more and more true with the passing of time), which permits deriving the following approximate relation between and , where is the average time at which the -th ball is drawn:

 n≈w1[0]Δ1eΔ1¯tn (25)

from which one obtains . Using this approximate value of into (4), we can approximate the expected values of edge weights after walks as:

 (E[w1[n]]E[w2[n]])≈(eΔ1¯tn00eΔ2¯tn)(w1[0]w2[0])=⎛⎜⎝Δ1nw2[0](nΔ1w1[0])Δ2Δ1⎞⎟⎠ (26)

The above approximation is not quite accurate for small values of . In particular, the normalized weight of edge 2, according to (26), can be even larger than the initial value . For this reason, for small values of , we improve the approximation by assuming that the (average) normalized weight of edge 2 cannot exceed its initial value at time 0. Indeed, we can easily find analytically the maximum value of , denoted by , for which we bound the normalized weight of edge 2 to the value