# Learning to Route Efficiently with End-to-End Feedback: The Value of Networked Structure

We introduce efficient algorithms which achieve nearly optimal regrets for the problem of stochastic online shortest path routing with end-to-end feedback. The setting is a natural application of the combinatorial stochastic bandits problem, a special case of the linear stochastic bandits problem. We show how the difficulties posed by the large scale action set can be overcome by the networked structure of the action set. Our approach presents a novel connection between bandit learning and shortest path algorithms. Our main contribution is an adaptive exploration algorithm with nearly optimal instance-dependent regret for any directed acyclic network. We then modify it so that nearly optimal worst case regret is achieved simultaneously. Driven by the carefully designed Top-Two Comparison (TTC) technique, the algorithms are efficiently implementable. We further conduct extensive numerical experiments to show that our proposed algorithms not only achieve superior regret performances, but also reduce the runtime drastically.

Comments

There are no comments yet.

## Authors

• 8 publications
• 40 publications
• ### Finding the Stochastic Shortest Path with Low Regret: The Adversarial Cost and Unknown Transition Case

We make significant progress toward the stochastic shortest path problem...
02/10/2021 ∙ by Liyu Chen, et al. ∙ 0

read it

• ### Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition

We study the stochastic shortest path problem with adversarial costs and...
12/07/2020 ∙ by Liyu Chen, et al. ∙ 0

read it

• ### Regret Bounds for Stochastic Shortest Path Problems with Linear Function Approximation

We propose two algorithms for episodic stochastic shortest path problems...
05/04/2021 ∙ by Daniel Vial, et al. ∙ 0

read it

• ### Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously

We develop the first general semi-bandit algorithm that simultaneously a...
01/25/2019 ∙ by Julian Zimmert, et al. ∙ 0

read it

• ### Learning Stochastic Shortest Path with Linear Function Approximation

We study the stochastic shortest path (SSP) problem in reinforcement lea...
10/25/2021 ∙ by Yifei Min, et al. ∙ 11

read it

• ### Learning to Prune: Speeding up Repeated Computations

It is common to encounter situations where one must solve a sequence of ...
04/26/2019 ∙ by Daniel Alabi, et al. ∙ 12

read it

• ### Implicit Finite-Horizon Approximation and Efficient Optimal Algorithms for Stochastic Shortest Path

We introduce a generic template for developing regret minimization algor...
06/15/2021 ∙ by Liyu Chen, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We study the problem of shortest path routing over a network, where the link delays are not known in advance. When delays are known, it is possible to compute the shortest path in polynomial time via the celebrated Dijkstra’s algorithm Dijkstra (1959) or the Bellman-Ford algorithm Bellman (1958). However, link delays are often unknown, and evolve over time according to some unknown stochastic process. Moreover, there are many real-world scenarios in which only the end-to-end delays are observable. For example, overlay network is an communication network architecture that integrates controllable overlay nodes into an uncontrollable underlay network of legacy devices. It is generally difficult to ensure individual link delay feedback when routing in an overlay network as the underlay nodes are not necessarily cooperative. Fig. 1 shows a very simple overlay network, where the only overlay nodes are the source node (node 1) and destination node (node 6); while the nodes within the dotted circle are underlay nodes. Here, the Decision Maker (DM) can choose to route the packets from one of the five paths available, namely and . If it picks path it can only get the realized delay of the whole path but not any of the realized delays of link or These uncertainties and the network architectural constraints make the problem fall into the category of stochastic online shortest path routing with end-to-end feedback Talebi et al. (2018).

Stochastic online shortest path routing with end-to-end feedback is one of the most fundamental real-time decision-making problems. In its canonical form, a DM is presented a network with

links, each link’s delay is a random variable, following an unknown stochastic process with unknown fixed mean over

rounds. In each round, a packet arrives to the DM, and it chooses a path to route the packet from the source to the destination. The packet then incurs a delay, which is the sum of the delays realized on the associated links. Afterwards, the DM learns the end-to-end delay, i.e., the realized delay of the path, but the individual link’s delay remains concealed. This is often called the bandit-feedback setting Talebi et al. (2018); Kveton et al. (2015). The DM’s goal is to design a routing policy that minimizes the cumulative expected delay. When the DM has full knowledge of the delay distributions, it would always choose to route the packets through the path with shortest expected delay. With that in mind, a reasonable performance metric for evaluating the policy is the expected regret

, defined to be the expected total delay of routing through the actual paths selected by the DM minus the expected total delay of routing through the path with shortest expected delay. In order to minimize the regret, the DM needs to learn the delay distributions on-the-fly. One viable approach to estimate the path delays is to inspect the end-to-end delays experienced by packets sent on different paths. This gives rise to an

exploration-exploitation dilemma. On one hand, the DM is not able to estimate the delay of an under-explored path; while on the other, the DM wants to send the the packet via the estimated shortest path to greedily minimize the cumulative delay incurred by the packets.

The Upper Confidence Bound (UCB) algorithm, following the Optimism-in-the-Face of Uncertainty

(OFU) principle, is one of the most prevalent strategies to deal with the exploration-exploitation dilemma. In the ordinary stochastic MAB settings, the UCB algorithm proposes a very intuitive policy framework, that DM should select actions by maximizing over rewards estimated from previous data but only after biasing each estimate according to its uncertainty. Simply put, one should choose the action that maximizes the “mean plus confidence interval.” Treating the inverse of delay as reward, a naive application of UCB algorithm to stochastic online shortest path routing can results in regret bounds and computation time that scale linearly with the number of paths. For small scale overlay networks, this achieves low regret efficiently. However, networks often have exponentially many paths, and direct implementation of the UCB algorithm is neither computationally efficient nor regret optimal. In the

combinatorial semi-bandits setting, the realized delay of each individual link on the chosen path is revealed. The authors of Gai et al. (2012) takes the advantage of the individual feedbacks, and propose a solution for the problem by computing the UCB of each link. The authors of Kveton et al. (2015); Talebi et al. (2018) further designs algorithms to match the regret lower bounds. Unfortunately, algorithms proposed for semi-bandit feedback setting cannot be extended to the bandit feedback setting as individual link feedback is not available.

When only end-to-end/bandit feedback is available, the authors of Liu and Zhao (2012) proposes algorithms with regret that has optimal dependence on the total number of rounds111The regret has sub-optimal dependence on the size of the network.. But the algorithm requires the DM to enumerate over the path set to select path in each round. This degrades the practicality of the algorithms significantly, especially when deployed in large-scale networks. Existing works have also tried to investigate the problem through the more general linear stochastic bandits setting, see e.g., Dani et al. (2008); Abbasi-Yadkori et al. (2009, 2011). Nevertheless, the proposed algorithms again suffer from high computational complexity Dani et al. (2008). Even worse, existing works in linear stochastic bandits literature ignore the network structure of the action set. Hence, only sub-optimal regret bounds are achieved.

As a matter of fact, the problem of stochastic online shortest path routing with end-to-end feedback falls into the category of combinatorial stochastic bandits, a special case of linear stochastic bandits with action set constrained to be subset of However, finding efficient algorithms for combinatorial/linear stochastic bandits with (nearly) optimal regret remains as an open problem Bubeck (2016). All of the above mentioned findings motivate us to exploit the networked structure of the action set to design efficient algorithms for the stochastic online shortest path problem with end-to-end feedback. Specifically, we aim at answering the following question:

Can we leverage the power of the network structure to design efficient algorithms that achieve (nearly) optimal instance-dependent and worst case regret bounds simultaneously for stochastic online shortest path routing under bandit-feedback?

In this paper, we give an affirmative answer to the above question. We start with algorithms for the stochastic online shortest path routing problem with identifiable network structure, and gradually remove the extra assumptions to arrive at the most general case. Specifically, our contributions can be summarized as follows:

• Assuming network identifiability, we first develop an efficient non-adaptive exploration algorithm with nearly optimal instance-dependent regret and sub-optimal worst case regret when the minimum gap 222The concepts of network identifiability, instance-dependent regret, worst case regret, and minimum gap will be defined in Section 2 and 3. is known.

• The main contribution is an adaptive exploration algorithm with nearly optimal instance-dependent regret without any knowledge of the minimum gap. Coupled with the novel Top-Two Comparison technique, the algorithms can be efficiently implemented. We also propose a simple modification for the algorithm to achieve nearly optimal worst case regret simultaneously.

• Complemented with an algorithm for finding basis in general networks, we show that our results can be applied to general networks without degrading the regret performances.

• We conduct extensive numerical experiments to validate that our proposed algorithms not only achieve superior regret performances, but also reduce the runtime drastically.

The rest of the paper is organized as follows. In Section 2, we describe the model of stochastic online shortest path routing with end-to-end feedback. In Section 3, we review the concepts of efficient exploration basis and make connections to network identifiability. Assuming network identifiability in Section 4, we propose the non-adaptive Explore-then-Commit algorithm to achieve nearly optimal instance-dependent regret when the minimum gap is known. In Section 5, we present the novel Top-Two Comparison and modified Top-Two Comparison algorithms to achieve nearly optimal instance-dependent and worst case regrets without any additional knowledge. In Section 6, we further study the problem without network identifiability, and propose an efficient algorithm with nearly optimal instance-dependent regret. In Section 7, we present numerical results to demonstrate the empirical performances of the proposed algorithms. In Section 8, we review related works in the bandits literature. In Section 9, we conclude our paper.

## 2 Problem Formulation

### 2.1 Notation

Throughout the paper, all the vectors are column vectors by default unless specified otherwise. We define

to be the set for any positive integer We use to denote the norm of a vector To avoid clutter, we often omit the subscript when we refer to the norm. For a positive definite matrix , we use to denote the matrix norm of a vector We also denote as the minimum between We follow the convention to describe the growth rate using the notations and If logarithmic factors are ignored, we use and respectively.

### 2.2 Model

Given a directed acyclic network , an online stochastic shortest path problem is defined by a -dimensional unknown but fixed mean link delay vector , paths for , and noise terms for where is the index for paths and is the index for rounds. Here, is the set of all possible paths in and for a path if and only if it traverses link With some abuse of notation, we use and interchangeably to denote path and we refer as both a set and a matrix. Routing a packet through path in round incurs the delay Following the convention of existing bandits literature Abbasi-Yadkori et al. (2011), we assume that is conditionally -sub-Gaussian, where is a fixed and known constant. Formally, this means

 ∀α∈RE[exp(αηt)|aI1,…,aIt−1,η1,…,ηt−1]≤exp(α2R22)

and

 E[ηt|aI1,…,aIt−1,η1,…,ηt−1]=0.

In each round a DM  follows a routing policy to choose the path to route the packet based on its past selections and previously observed feedback. Here, we consider end-to-end (bandit) feedback setting in which only the delay of the selected path is observable as a whole rather than the individual (semi-bandit) feedback in which the delays of all the traversed links are revealed. We measure the performance of via expected regret against the optimal policy with full knowledge of

 E[RegretT(P)]=E[T∑t=1Lt,It−mink∈[K]T∑t=1Lt,k]=T∑t=1⟨aIt,μ⟩−T⟨a∗,μ⟩,

where is the optimal path. In this paper, we require that is unique. For any path we define as the difference of expected delay, i.e., the gap, between and The maximum and minimum of over all with are denoted as and and are referred to as the maximum and minimum gap, respectively.Without loss of generality, we assume 333We shall relax this in the numerical experiments in Section 7. so that each path’s expected delay is within and hence,

 Δmax≤d. (1)

As it is common in stochastic bandit learning settings Auer et al. (2002); Abbasi-Yadkori et al. (2011), we distinguish between two different regret measures, namely the instance-dependent regret and the worst case regret

• Instance-dependent regret: A regret upper bound is called instance-dependent if it is comprised of quantities that only depend on ’s, and absolute constants.

• Worst case regret: A regret upper bound is called worst case if it is comprised of quantities that only depend on to and absolute constants.

It is commonly known that when ’s are allowed in the regret expressions, the regret can fall into the regime Auer et al. (2002). But depending on the choice of can become extremely small for any given and and the instance-dependent regret guarantee becomes meaningless. We therefore have to turn to the worst-case regret bound. We note that the regret is given by the minimum of the instance-dependent regret and worst case regret. Hence, it is desirable to obtain computationally efficient algorithms that have good instance-dependent and worst case regrets at the same time. Denoting as the maximal length of all the paths, i.e., the instance-dependent regret lower bound is unclear yet, but from the combinatorial semi-bandits setting Kveton et al. (2015) where individual feedback is available additionally, we know that it is of order at least Lattimore and Szepesvari (2018); The tight worst case regret lower bound is Cohen et al. (2017).

### 2.3 Design Challenges and Solution Strategies

Since the mean link delay vector is unknown, and we only get to know the end-to-end delay of the chosen path in each round, the DM falls into the so called exploration-exploitation dilemma. On one hand, the DM needs to explore the network to acquire accurate estimate of the expected delay of each path; while on the other, it needs to exploit the path with least delay to ensure low regret. As our problem resembles the stochastic multi-armed bandits problem, there are at least two natural approaches to address it:

• Optimism-in-the-Face-of-Uncertainty (OFU): Following this principle, the DM balances exploration and exploitation by optimistically choosing the action with lowest confidence bound, i.e., the empirical mean loss with the confidence interval subtracted. In Dani et al. (2008); Abbasi-Yadkori et al. (2011), this approach has been shown to work in the general linear stochastic bandits setting, yet as pointed out in Section 1, a direct adoption of the OFU principle to our problem cannot work. First, it fails to capture the underlying network structure, and brings a sub-optimal instance-dependent and worst case regret bounds Abbasi-Yadkori et al. (2011). Even worse, the practicality of the algorithm is hindered by the high computational complexity in choosing the path to route. Indeed, it has been shown in Dani et al. (2008) that the algorithm for path selection is polynomial time equivalent to a NP-hard negative definite linearly constrained quadratic programming.

• Explore-then-Exploit: Instead of doing exploration and exploitation simultaneously, the DM can collect data to construct accurate estimates for all actions’ losses by first performing uniform exploration over all possible actions, and eliminates an action whenever it is confident that this action is sub-optimal. This procedure runs until there is only one action left. It has been shown in Auer and Ortner (2010) that the adaptive exploration approach works well for the ordinary stochastic multi-armed bandits setting. A similar approach has been applied to our problem of interest by the authors of Liu and Zhao (2012), and they achieve a sub-optimal instance-dependent regret with an inefficient algorithm. Here is the rank of the path matrix

As it is unclear how to get the OFU approach to work efficiently in our setting, we adopt the explore-then-exploit approach here. An immediate difficulty in implementing this approach is that the DM cannot afford to uniformly explore exponentially many paths. It’s thus of great importance to devise a way to efficiently collect data in the stochastic online shortest path routing setting.

## 3 Exploration Basis

In order to execute the uniform exploration efficiently, the DM relies on a basis for the network. Intuitively, a set is a basis for if it “spans” the set i.e., each path in can be expressed as a linear combination of the paths in If the DM is able to accurately estimates the delays of the basis paths, it can also construct accurate delay estimators for all the paths in thanks to the linearity property. It is worth noting that the concept of exploration basis has been raised in adversarial linear bandits before Awerbuch and Kleinberg (2004), and we review it here as it is going to be useful for our problem.

### 3.1 Barycentric Spanners and Network Identifiability

Note that we have several requirements for First of all, the paths of should come from i.e., so that the DM can select them. Next, the set should span the original path set i.e., Finally, denote the paths in as and suppose any path can be expressed as a linear combination of paths of i.e., there exits such that

 a=Bνa=d∑i=1νa,ibi. (2)

We require that the absolute value of any is bounded by some (small) positive constant i.e.,

 ∀a∈A∀i∈[d]νa,i≤S. (3)

To see the rationale behind the last requirement, we decompose the estimation error on ’s delay as follows:

 |⟨a,^μ−μ⟩|=∣∣ ∣∣⟨d∑i=1νa,ibi,^μ−μ⟩∣∣ ∣∣=∣∣ ∣∣d∑i=1νa,i⟨bi,^μ−μ⟩∣∣ ∣∣. (4)

Here is any estimate of From eq. (4), we can see that all the ’s should have small absolute values as otherwise, even small estimation error can be scale up drastically by any with large absolute values. To this end, we introduce the concept of barycentric spanner introduced by the authors of Awerbuch and Kleinberg (2004):

###### Definition 1 (Barycentric spanner Awerbuch and Kleinberg (2004)).

Let be a vector space over the real numbers, and a subset whose linear span is a -dimensional subspace of A set is a barycentric spanner for if every may be expressed as a linear combination of elements of using coefficients in is the -approximate barycentric spanner if every may be expressed as a linear combination of elements of using coefficients in

The authors of Awerbuch and Kleinberg (2004) also presented a result regarding the existence and search of barycentric spanner.

###### Proposition 1 (Awerbuch and Kleinberg (2004)).

Suppose is a compact set not contained in any proper linear subspace. Given an oracle for optimizing linear functions over for any we may compute a -approximate barycentric spanner for in polynomial time, using calls to the optimization oracle.

The authors of Awerbuch and Kleinberg (2004) also present an algorithm for finding a -approximate barycentric spanner for any . For completeness of presentation, we include this in Appendix B. The assumption stated in Proposition 1 that the set is not contained in any proper subspace is closely related to network identifiability. Informally, we say that a network with links is identifiable if its set of paths, spans the space . In Theorem 3.1 of Ma et al. (2013), the authors showed that it is in general impossible for to be identifiable if all the paths in originate from and end at the same pair of nodes, but Theorem 3.2 of Ma et al. (2013) also states that it is possible for a subgraph of to be identifiable. To accelerate our discussion, we call each of the links that is incident to either the source or the destination as an external link, and all other links the internal links. A network with both the source and destination nodes as well as all the external links of removed is called the internal network. In Fig. 1, links and are external links; while the rest are internal links. We can see that the internal network with node is identifiable as the paths and span the space To this end, we temporarily make the following additional assumption (to be relaxed in Section 6)

###### Assumption 1.

The internal network of is identifiable, and the expected delays of all the external links are known a priori. To avoid clutters, we further assume that the expected delays of the external links are deterministically 0.

With some abuse of notation, refers to the number of internal links whenever Assumption 1 is imposed, and it is equal to Given Proposition 1 and Assumption 1, the DM can pick a positive number first, and then implement Algorithm 5 in Appendix B to identify in polynomial time the -approximate barycentric spanner i.e., for any path there exists some such that By the definition of -approximate barycentric spanner, the maximal norm of over all is upper bounded by i.e.,

 maxa∈A∥νa∥≤ ⎷d∑i=1S2≤S√d. (5)

## 4 Explore-then-Commit Algorithm: A Warm-Up

In this section, we develop the Explore-then-Commit (EC) algorithm based on non-adaptive exploration to solve the problem.

### 4.1 Design Intuitions

The design of the EC algorithm follows an intuitive rationale: if the DM is able to recover the expected delay of each path of the

accurately, it will also be able to accurately estimate the expected delay of each path as the delay of each path is the linear combination of the elements in the barycentric spanner. Once the DM believes that the optimal path has been found with high probability, it could choose to commit to this path, and incurs low regret. To begin, we assume that the DM knows the minimum gap

We will later relax this assumption to obtain practical algorithms.

### 4.2 Design Details

Given a positive integer we aim at getting a good estimate of in the first rounds, and then chooses the estimated best path in each of the remaining rounds. We thus call the first rounds as the exploration stage, and the remaining

rounds as the committing stage. The EC algorithm divides the exploration stage into epochs of length

and chooses each path in once in every epoch until the end of the exploration stage. Afterwards, the EC algorithm makes use of the Ordinary Least Square (OLS) estimator to construct an estimate for Specifically, the paths used in the first epochs (or rounds) form the design matrix

and the observed losses form the response vector

 rn=(L1,I1,…,Lnd,Ind)⊤.

The OLS estimator then gives us

 ^μn=(D⊤nDn)−1D⊤nrn. (6)

Thanks to the identifiability assumption, is full rank, and is well-defined. One can easily verify Finally, the EC algorithm applies an arbitrary shortest path algorithm to compute the path with the lowest estimated delay, and commits to this path in the exploitation stage.

### 4.3 Regret Analysis

To properly tune the parameter an essential tool is a deviation inequality on the estimation errors.

###### Theorem 1.

After epochs of explorations, the probability that there exists a path such that the estimated mean delay of deviates from its mean delay by at least is at most i.e.,

 Pr⎛⎝∃a∈A:|⟨a,μ⟩−⟨a,^μm⟩|≥SR√2ln(2)d2+4dlnδ−1m⎞⎠≤δ.
###### Proof.

The proof of Theorem 1 makes use of the convergence property of the OLS estimator and the fact that the is the -approximate barycentric spanner with . Please refer to Section A.1 for the complete proof. ∎

We are now ready to present the regret bound of EC algorithm.

###### Theorem 2.

With the knowledge of EC algorithm has the following regret bounds:

• Instance-dependent regret:

• Worst case regret:

###### Proof.

Please refer to Section A.2 for the complete proof. ∎

###### Remark 1.

The instance-dependent regret bound obtained in Theorem 2 is a significant improvement compared to the direct application of OFU approach, and the worst case regret can be achieved without knowing Nevertheless, we should be aware that the choice of for the instance-dependent regret bound relies on knowing which is never the case in practice.

Though being computationally efficient, the above remark indicates that the non-adaptive EC algorithm is not sufficient to achieve optimal regret bounds.

## 5 Top-Two Comparison Algorithm: An Adaptive Exploration Approach

As we have seen from the previous discussions, the non-adaptive EC algorithm fails to make full use of the observed delays to explore adaptively, and its success relies almost solely on knowing ahead of time.

In this section, we study adaptive exploration algorithms that have been shown to achieve nearly optimal regret bounds in stochastic MAB Auer and Ortner (2010); Slivkins (2017) to obtain nearly optimal instance-dependent and worst case regret bounds. Different from those in ordinary stochastic MAB settings, the algorithm builds on top of a novel top two comparison (TTC) method to allow efficient computation. We start by attaining a nearly optimal instance-dependent regret bound, and then show how to attain a nearly optimal worst case regret bound simultaneously.

### 5.1 Design Intuitions

Adaptive exploration algorithms often serve as an alternative for UCB algorithms in stochastic multi-armed bandits Auer and Ortner (2010); Slivkins (2017). In Auer and Ortner (2010); Slivkins (2017), the DM uniformly explores all remaining actions, and periodically executes an action elimination rule to ensure with high probability that:

• The optimal action remains;

• The sub-optimal actions can be removed effectively.

until only one action is left, and commits to that action in the rest of the rounds. The adaptive exploration algorithms achieve optimal instance-dependent and worst case regret bounds for stochastic multi-armed bandits.

We start by demonstrating how an adaptive exploration algorithm can achieve the nearly optimal instance-dependent regret bound. Similar to the EC algorithm, the adaptive exploration algorithm also splits the rounds into an exploration stage and a committing stage: in each epoch of the exploration stage, the DM selects every path in once so that all of them have samples. To ease our presentation, we denote the estimated shortest path after epochs of uniform exploration as i.e.,

 ~am←argmina∈A⟨a,^μm⟩,

and follow Theorem 1 to denote the confidence bound as i.e.,

 ~Δm=SR√2ln(2)d2+4dlnδ−1m. (7)

We denote the total length of exploration stage by a random variable We then use a simple union bound to show the probability that there exists a path such that the estimated mean delay of deviates from its mean delay by at least at the end of any epoch in the committing stage can be upper bounded as

 Pr(∃m∈[N],a∈A:|⟨a,μ⟩−⟨a,^μm⟩|≥~Δm) ≤ N∑m=1Pr(∃a∈A:|⟨a,μ⟩−⟨a,^μm⟩|≥~Δm) ≤ T∑m=1δ (8) ≤ Tδd,

where we have used Theorem 1 and the fact that in inequality (8). In other words, if we denote the event as following: any path ’s estimated delay is within distance from its true expected delay for all i.e.,

 E={∀m∈[N]∀a∈A:|⟨a,μ⟩−⟨a,^μm⟩|≤~Δm} (9)

then event holds with probability at least in the adaptive exploration algorithm. From inequality (1), we have and the worst possible total regret (i.e., choosing the path with maximum gap in each round) an algorithm can incur is we can tune properly, i.e., setting so that the regret incurred by the algorithm in case does not hold is at most Therefore, we only need to focus the case when holds.

Conditioned on we assert that the DM could detect if any of the remaining paths is sub-optimal by checking whether

 ⟨ak,^μm⟩−⟨~am,^μm⟩>2~Δm (10)

holds at the end of each epoch Afterwards, the identified sub-optimal paths are eliminated. We use Figure 2 to illustrate the rationale behind this criterion. Note that in both Fig. 2(a) and 2(b), the horizontal right arrow is the positive number axis.

In Fig. 2(a), suppose and lie at and respectively. Conditioned on event should locate within the interval while should locate within the interval Now if and are more than away from each other, then

 ⟨~am,μ⟩<⟨ak,μ⟩. (11)

In other words, path is sub-optimal as its expected delay is at least longer than

Similarly in Fig. 2(b), suppose and lie at and respectively. Conditioned on event should locate to the left of while should locate to the right of Now if then

 ⟨ak,^μm⟩−⟨~am,^μm⟩>2~Δm, (12)

which means the sub-optimal path is detected according to criterion (10).

We formalize these observations in the following lemma.

###### Lemma 1.

Conditioned on event if criterion (10) holds, then

1. path is sub-optimal;

2. any sub-optimal path with is detected.

###### Proof.

The proof follows from the above arguments. Please refer to Section A.3 for the complete proof. ∎

These two nice properties of criterion (10) jointly guarantees that the optimal path remains in , and any sub-optimal path is removed once shrinks down to below Specifically, if arrives to a value that or

 ¯¯¯¯¯m=S2R2(32ln(2)d2+64dlnδ−1)Δ2min (13)

follows from eq. (7), all sub-optimal paths should have been eliminated.

Roughly speaking, conditioned on the regret of the adaptive algorithm is

 d¯¯¯¯¯mΔmax=O((d2lnδ−1+d3)ΔmaxΔ2min). (14)

Recalling that the regret conditioned on is at most setting the expected regret of this algorithm is upper bounded as and we shall formalize this analysis in Theorem 3. Surprisingly, adaptivity saves us from a lack of knowledge on the exact value of

### 5.2 Efficient Implementation

One may note that implementing the criterion (10) requires an enumeration over the set which is typically exponential in size (in terms of ). In this subsection, we further propose an polynomial time implementation, namely the Top Two Comparison (TTC) algorithm, for our problem.

Different from the adaptive exploration algorithms proposed for stochastic multi-armed bandit problems Auer and Ortner (2010); Slivkins (2017), which uniformly explores the set of remaining actions, our strategy decouples the exploration basis from path elimination by making use of the -approximate barycentric spanner In other words, the DM does not need to eliminate the sub-optimal paths one by one. As the optimal path is unique by assumption, it can instead remove all of them at the same time once the difference between the delay of the estimated shortest path and the delay of the estimated second shortest path is larger than for some epoch

To find the estimated second shortest path, we make the observation that the estimated second shortest path should traverse at least one link that is different than those in the estimated shortest path. The DM could start by iteratively setting the delay of links traversed by the shortest path to a large number, i.e., one at a time, while keeping the estimated delays of all other links intact, and find the delay of the shortest path with respect to the “perturbed” estimated delay vector. Finally, the minimum delay over these “perturbed” delays is the second shortest delay.

### 5.3 Design Details

We are now ready to formally present the TTC algorithm. Following the design guidelines presented in Sections 5.1 and 5.2, the TTC algorithm initializes the set of remaining path as and divides the time horizon into epochs. In the epoch, TTC algorithm distinguishes two cases:

1. If contains only one path, TTC algorithm chooses this path, and sets

2. Otherwise, the TTC algorithm picks each path in once so that every path in has been selected times. It then computes the OLS estimate for and identifies the path with least estimated delay, i.e., and the path with estimated second shortest delay, i.e., via a second shortest path sub-routine. Afterwards, TTC algorithm checks the gap between and If The set of remaining path for the epoch is denoted as otherwise,

The pseudo-code of TTC algorithm is shown in Algorithm 1 and the pseudo-code of the sub-routine for finding second shortest path is shown in Algorithm 2. Please note that the algorithms are run in epochs (indexed by ), and can be represented by the incidence matrix of

### 5.4 Regret Analysis

The analysis essentially follows the intuition presented in Section 5.1, and the instance-dependent regret of the TTC algorithm is given by the following theorem.

###### Theorem 3.

For any the instance-dependent expected regret of TTC algorithm is bounded as

 E[RegretT({TTC algorithm% })]≤O((d2lnT+d3)ΔmaxΔ2min).
###### Proof.

Please refer to Section A.4 for the complete proof. ∎

We now comment on the bound provided in Theorem 3. In the worst case, i.e., when if the RHS of Theorem 3 is of order As the regret bound from adversarial linear bandits is of order this indicates that the instance-dependent regret bound becomes meaningless once becomes smaller than Even though adaptive exploration saves us from not knowing it cannot achieve nearly optimal worst case regret bound automatically. This is because the TTC algorithm shares similar structure to EC algorithm, and as we have seen in Theorem 2 that tuning the parameter to achieve sub-optimal worst case regret bound does not require any knowledge of either. Some other techniques are needed if we want to get nearly optimal instance-dependent and worst case regrets at the same time.

### 5.5 Getting Nearly Optimal Worst Case Regret

It turns out that we can get nearly optimal instance-dependent and worst case regrets at the same time with just a bit more effort. The key idea is to limit the length of the exploration stage so that once the smallest gap is believed to be smaller than with high probability, the DM switches to an efficient alternative algorithm for adversarial linear bandits to solve the problem. A candidate for the alternative algorithm can be found in Bubeck and Eldan (2015). Specifically, we set

 ¯¯¯n=√TS2R2(2ln(2)d+8lnT)/d2,

and modifies the TTC algorithm as following:

1. For each epoch the DM runs the TTC algorithm;

2. If the set contains only one path, the DM selects this path in the rest of the rounds;

3. Else if the set contains more than one path, the DM finds that holds with probability at least and thus terminates the TTC algorithm, and runs the efficient algorithm for adversarial linear bandits in Bubeck and Eldan (2015) over the network to solve the problem.

We name this as the Modified Top Two Comparison (MTTC) algorithm, and its pseudo-code is shown in Algorithm 3.

We are now ready to state the regret bound of MTTC algorithm.

###### Theorem 4.

For any the MTTC algorithm has the following regret bounds:

• Instance-dependent regret:

 O((d2lnT+d3)ΔmaxΔ2min).
• Worst case regret:

 ˜O(d√T).
###### Proof.

Please refer to Section A.5 for the complete proof. ∎

## 6 General Networks

The success of the TTC algorithm and the MTTC algorithm in achieving nearly optimal regrets rely on the identifiability assumption, i.e., Assumption 1, which might be violated in practice. For example, if the network scale grows large, it is very likely that even the internal network of is not fully identifiable. Also, if the external links are shared among many entities, it is hard to obtain the expected delays of all the external links. For a general network, one possible way to find a -approximate barycentric spanner is to project into some sub-space so that it is still full rank in that sub-space. But it is unclear how to implement the projection without enumerating all the paths in which is computationally inefficient. Therefore, we are in need of a new technique for our problem. In this section, we show how to implement the MTTC algorithm algorithm for general networks. We start by proposing an algorithm for finding a basis of when does not span . We note that any basis of is automatically -approximate barycentric spanner of with some (possibly unknown at first) positive number We then state the difference in estimating between identifiable and general networks, and present a general version of OLS estimator with provable deviation property. Throughout this section, we shall assume that the rank of is

### 6.1 Additional Notation

In this section, we will make use of matrix notations heavily. For any matrix we use to denote its element at the row and column, and to denote its row and column vectors, respectively, and and to denote the matrices obtained by keeping only the to rows and to columns, respectively Moreover, and are the matrices obtained by removing the row and of respectively. is the -by- matrix obtained by removing the row and column of simultaneously.

### 6.2 Efficient Algorithm for Finding the Basis

As a first step, we present a greedy algorithm that finds the basis of even when the network is unidentifiable. Inspired by the algorithm for finding the -approximate barycentric spanner for identifiable networks, i.e., Algorithm 5 in Appendix B, the high-level idea of the algorithm can be described as following:

1. Initiate a matrix to the -by-identity matrix;

2. Greedily replace as many columns of as possible by paths in while keeping full rank.

3. All the columns in that are obtained from constitute

Since steps (1) and (3) can be easily implemented, we further elaborate on an iterative algorithm for step (2). For ease of presentation, we use to denote the resulted matrix after the iteration with At the beginning of the iteration, suppose can be written as

 Cu=(C′u,C′′u), (15)

where are the columns obtained from while are the columns inherited from , the algorithm then finds a column such that replacing with an element in can result in a full rank matrix, and sets

 Cu+1=(a,Cu(:,−j)), (16)

where is the column index of This algorithm terminates once such cannot be found in after some iterations

To efficiently implement the above iterative algorithm, i.e., to find such in each iteration if it exists, we note that the matrix is full rank if and only if the determinant of is nonzero, i.e.,

 rank(Cu+1)=d⇔detCu+1≠0. (17)

For now, suppose we are given a full rank matrix if the column of is replaced by an to form

 Cju=(Cu(1,j−1),a,Cu(j+1,d)),

the determinant of can be written as a linear function of i.e.,

 detCju=d∑i=1[(−1)i+jdet(Cu(−i,−j))]ai (18)

by the Laplace expansion, and the value of can be computed efficiently using the LU decomposition. Now to find an index and that satisfies we can equivalently solve the following optimization problem

 maxa∈A∣∣detCju∣∣, (19)

for all If there exists some such that the solution satisfies we can then replace the column of by to form according to eq. (16).

For a given defining a vector with each entry defined by eq. (18), i.e.,

 ∀i∈[d]cj,i=[(−1)i+jdet(Cu(−i,−j))], (20)

the optimal solution of (19) can be obtained by first solving the following two sub-problems

 maxa∈A⟨cj,a⟩,mina∈A⟨−cj,a⟩, (21)

and then picking the solution with larger absolute value. To solve the first sub-problem, we can use the following steps:

1. Assign delay to link of for all

2. Compute the longest path. This requires a call to an appropriate efficient longest path algorithm for directed acyclic graphs, e.g., topological sorting Cormen et al. (2009).

The formal description of this algorithm for basis identification is shown in Algorithm 4.

We are now ready to prove the correctness of the algorithm, i.e., if the rank of is then Algorithm 4 returns a basis such that the rank of is

###### Lemma 2.

Algorithm 4 terminates in polynomial time. Upon termination, the matrix returned by Algorithm 4 is a basis of i.e., has linearly independent columns and for every there exists a vector such that

###### Proof.

Please refer to Section A.6 for the complete proof. ∎

###### Remark 2.

Although does not span we still develop an efficient algorithm for computing the basis of With some abuse of notation, we note that any basis of is automatically a -approximate barycentric spanner of with some positive number i.e.,

 S=maxj∈[d0],a∈A|νa,j|. (22)

However, since does not span the space as required by Proposition 1, we cannot set arbitrarily first with the hope that we can find the corresponding -approximate barycentric spanner using Algorithm 5 in Section B.

### 6.3 OLS Estimator for General Networks

With the new basis at hand, we can almost follow what we have developed in Section 5, i.e., eq. (6), to estimate But a more careful inspection suggests a crucial difference between identifiable network setting and the general network setting: since the -by- matrix is singular, i.e., for all As a result, we cannot compute the OLS estimate of the same as eq. (6).

To allow the DM to implement the MTTC algorithm for general networks, we need to resolve the issues raised by the singularity of To this end, we use a slightly different version of OLS estimator Rigollet and Hutter (2017), i.e., the OLS estimator of after epochs of explorations is

 ^μm=(D⊤mDm)†Dmrm, (23)

where denotes the Moore-Penrose pseudo-inverse of We are now ready to state a new deviation inequality on the estimation errors. Here with some abuse of notations, we recall from inequality (3) that is the upper bound on the absolute value of for all and

###### Theorem 5.

For a given positive integer the probability that there exists a path such that the estimated mean delay of deviates from its mean delay by at least is at most after epochs of explorations, i.e.,

 Pr⎛⎜⎝|⟨a,μ⟩−⟨a,^μ⟩|≥SR√32ln(6)d20+32d0lnδ−1m⎞⎟⎠≤δ.
###### Proof.

Please refer to Section A.7 for the complete proof. ∎

### 6.4 Upper Bounding S and Obtaining Low Regrets

By design of the MTTC algorithm, we only need to change the following parameters according to Theorem 5:

 The length of each epoch=d0, (24) ~Δm=SR√32ln(6)d20+96d0lnTm,∀m=1,2,…, (25)