1 Introduction
We study the problem of shortest path routing over a network, where the link delays are not known in advance. When delays are known, it is possible to compute the shortest path in polynomial time via the celebrated Dijkstra’s algorithm Dijkstra (1959) or the BellmanFord algorithm Bellman (1958). However, link delays are often unknown, and evolve over time according to some unknown stochastic process. Moreover, there are many realworld scenarios in which only the endtoend delays are observable. For example, overlay network is an communication network architecture that integrates controllable overlay nodes into an uncontrollable underlay network of legacy devices. It is generally difficult to ensure individual link delay feedback when routing in an overlay network as the underlay nodes are not necessarily cooperative. Fig. 1 shows a very simple overlay network, where the only overlay nodes are the source node (node 1) and destination node (node 6); while the nodes within the dotted circle are underlay nodes. Here, the Decision Maker (DM) can choose to route the packets from one of the five paths available, namely and . If it picks path it can only get the realized delay of the whole path but not any of the realized delays of link or These uncertainties and the network architectural constraints make the problem fall into the category of stochastic online shortest path routing with endtoend feedback Talebi et al. (2018).
Stochastic online shortest path routing with endtoend feedback is one of the most fundamental realtime decisionmaking problems. In its canonical form, a DM is presented a network with
links, each link’s delay is a random variable, following an unknown stochastic process with unknown fixed mean over
rounds. In each round, a packet arrives to the DM, and it chooses a path to route the packet from the source to the destination. The packet then incurs a delay, which is the sum of the delays realized on the associated links. Afterwards, the DM learns the endtoend delay, i.e., the realized delay of the path, but the individual link’s delay remains concealed. This is often called the banditfeedback setting Talebi et al. (2018); Kveton et al. (2015). The DM’s goal is to design a routing policy that minimizes the cumulative expected delay. When the DM has full knowledge of the delay distributions, it would always choose to route the packets through the path with shortest expected delay. With that in mind, a reasonable performance metric for evaluating the policy is the expected regret, defined to be the expected total delay of routing through the actual paths selected by the DM minus the expected total delay of routing through the path with shortest expected delay. In order to minimize the regret, the DM needs to learn the delay distributions onthefly. One viable approach to estimate the path delays is to inspect the endtoend delays experienced by packets sent on different paths. This gives rise to an
explorationexploitation dilemma. On one hand, the DM is not able to estimate the delay of an underexplored path; while on the other, the DM wants to send the the packet via the estimated shortest path to greedily minimize the cumulative delay incurred by the packets.The Upper Confidence Bound (UCB) algorithm, following the OptimismintheFace of Uncertainty
(OFU) principle, is one of the most prevalent strategies to deal with the explorationexploitation dilemma. In the ordinary stochastic MAB settings, the UCB algorithm proposes a very intuitive policy framework, that DM should select actions by maximizing over rewards estimated from previous data but only after biasing each estimate according to its uncertainty. Simply put, one should choose the action that maximizes the “mean plus confidence interval.” Treating the inverse of delay as reward, a naive application of UCB algorithm to stochastic online shortest path routing can results in regret bounds and computation time that scale linearly with the number of paths. For small scale overlay networks, this achieves low regret efficiently. However, networks often have exponentially many paths, and direct implementation of the UCB algorithm is neither computationally efficient nor regret optimal. In the
combinatorial semibandits setting, the realized delay of each individual link on the chosen path is revealed. The authors of Gai et al. (2012) takes the advantage of the individual feedbacks, and propose a solution for the problem by computing the UCB of each link. The authors of Kveton et al. (2015); Talebi et al. (2018) further designs algorithms to match the regret lower bounds. Unfortunately, algorithms proposed for semibandit feedback setting cannot be extended to the bandit feedback setting as individual link feedback is not available.When only endtoend/bandit feedback is available, the authors of Liu and Zhao (2012) proposes algorithms with regret that has optimal dependence on the total number of rounds^{1}^{1}1The regret has suboptimal dependence on the size of the network.. But the algorithm requires the DM to enumerate over the path set to select path in each round. This degrades the practicality of the algorithms significantly, especially when deployed in largescale networks. Existing works have also tried to investigate the problem through the more general linear stochastic bandits setting, see e.g., Dani et al. (2008); AbbasiYadkori et al. (2009, 2011). Nevertheless, the proposed algorithms again suffer from high computational complexity Dani et al. (2008). Even worse, existing works in linear stochastic bandits literature ignore the network structure of the action set. Hence, only suboptimal regret bounds are achieved.
As a matter of fact, the problem of stochastic online shortest path routing with endtoend feedback falls into the category of combinatorial stochastic bandits, a special case of linear stochastic bandits with action set constrained to be subset of However, finding efficient algorithms for combinatorial/linear stochastic bandits with (nearly) optimal regret remains as an open problem Bubeck (2016). All of the above mentioned findings motivate us to exploit the networked structure of the action set to design efficient algorithms for the stochastic online shortest path problem with endtoend feedback. Specifically, we aim at answering the following question:
Can we leverage the power of the network structure to design efficient algorithms that achieve (nearly) optimal instancedependent and worst case regret bounds simultaneously for stochastic online shortest path routing under banditfeedback?
In this paper, we give an affirmative answer to the above question. We start with algorithms for the stochastic online shortest path routing problem with identifiable network structure, and gradually remove the extra assumptions to arrive at the most general case. Specifically, our contributions can be summarized as follows:

Assuming network identifiability, we first develop an efficient nonadaptive exploration algorithm with nearly optimal instancedependent regret and suboptimal worst case regret when the minimum gap ^{2}^{2}2The concepts of network identifiability, instancedependent regret, worst case regret, and minimum gap will be defined in Section 2 and 3. is known.

The main contribution is an adaptive exploration algorithm with nearly optimal instancedependent regret without any knowledge of the minimum gap. Coupled with the novel TopTwo Comparison technique, the algorithms can be efficiently implemented. We also propose a simple modification for the algorithm to achieve nearly optimal worst case regret simultaneously.

Complemented with an algorithm for finding basis in general networks, we show that our results can be applied to general networks without degrading the regret performances.

We conduct extensive numerical experiments to validate that our proposed algorithms not only achieve superior regret performances, but also reduce the runtime drastically.
The rest of the paper is organized as follows. In Section 2, we describe the model of stochastic online shortest path routing with endtoend feedback. In Section 3, we review the concepts of efficient exploration basis and make connections to network identifiability. Assuming network identifiability in Section 4, we propose the nonadaptive ExplorethenCommit algorithm to achieve nearly optimal instancedependent regret when the minimum gap is known. In Section 5, we present the novel TopTwo Comparison and modified TopTwo Comparison algorithms to achieve nearly optimal instancedependent and worst case regrets without any additional knowledge. In Section 6, we further study the problem without network identifiability, and propose an efficient algorithm with nearly optimal instancedependent regret. In Section 7, we present numerical results to demonstrate the empirical performances of the proposed algorithms. In Section 8, we review related works in the bandits literature. In Section 9, we conclude our paper.
2 Problem Formulation
2.1 Notation
Throughout the paper, all the vectors are column vectors by default unless specified otherwise. We define
to be the set for any positive integer We use to denote the norm of a vector To avoid clutter, we often omit the subscript when we refer to the norm. For a positive definite matrix , we use to denote the matrix norm of a vector We also denote as the minimum between We follow the convention to describe the growth rate using the notations and If logarithmic factors are ignored, we use and respectively.2.2 Model
Given a directed acyclic network , an online stochastic shortest path problem is defined by a dimensional unknown but fixed mean link delay vector , paths for , and noise terms for where is the index for paths and is the index for rounds. Here, is the set of all possible paths in and for a path if and only if it traverses link With some abuse of notation, we use and interchangeably to denote path and we refer as both a set and a matrix. Routing a packet through path in round incurs the delay Following the convention of existing bandits literature AbbasiYadkori et al. (2011), we assume that is conditionally subGaussian, where is a fixed and known constant. Formally, this means
and
In each round a DM follows a routing policy to choose the path to route the packet based on its past selections and previously observed feedback. Here, we consider endtoend (bandit) feedback setting in which only the delay of the selected path is observable as a whole rather than the individual (semibandit) feedback in which the delays of all the traversed links are revealed. We measure the performance of via expected regret against the optimal policy with full knowledge of
where is the optimal path. In this paper, we require that is unique. For any path we define as the difference of expected delay, i.e., the gap, between and The maximum and minimum of over all with are denoted as and and are referred to as the maximum and minimum gap, respectively.Without loss of generality, we assume ^{3}^{3}3We shall relax this in the numerical experiments in Section 7. so that each path’s expected delay is within and hence,
(1) 
As it is common in stochastic bandit learning settings Auer et al. (2002); AbbasiYadkori et al. (2011), we distinguish between two different regret measures, namely the instancedependent regret and the worst case regret

Instancedependent regret: A regret upper bound is called instancedependent if it is comprised of quantities that only depend on ’s, and absolute constants.

Worst case regret: A regret upper bound is called worst case if it is comprised of quantities that only depend on to and absolute constants.
It is commonly known that when ’s are allowed in the regret expressions, the regret can fall into the regime Auer et al. (2002). But depending on the choice of can become extremely small for any given and and the instancedependent regret guarantee becomes meaningless. We therefore have to turn to the worstcase regret bound. We note that the regret is given by the minimum of the instancedependent regret and worst case regret. Hence, it is desirable to obtain computationally efficient algorithms that have good instancedependent and worst case regrets at the same time. Denoting as the maximal length of all the paths, i.e., the instancedependent regret lower bound is unclear yet, but from the combinatorial semibandits setting Kveton et al. (2015) where individual feedback is available additionally, we know that it is of order at least Lattimore and Szepesvari (2018); The tight worst case regret lower bound is Cohen et al. (2017).
2.3 Design Challenges and Solution Strategies
Since the mean link delay vector is unknown, and we only get to know the endtoend delay of the chosen path in each round, the DM falls into the so called explorationexploitation dilemma. On one hand, the DM needs to explore the network to acquire accurate estimate of the expected delay of each path; while on the other, it needs to exploit the path with least delay to ensure low regret. As our problem resembles the stochastic multiarmed bandits problem, there are at least two natural approaches to address it:

OptimismintheFaceofUncertainty (OFU): Following this principle, the DM balances exploration and exploitation by optimistically choosing the action with lowest confidence bound, i.e., the empirical mean loss with the confidence interval subtracted. In Dani et al. (2008); AbbasiYadkori et al. (2011), this approach has been shown to work in the general linear stochastic bandits setting, yet as pointed out in Section 1, a direct adoption of the OFU principle to our problem cannot work. First, it fails to capture the underlying network structure, and brings a suboptimal instancedependent and worst case regret bounds AbbasiYadkori et al. (2011). Even worse, the practicality of the algorithm is hindered by the high computational complexity in choosing the path to route. Indeed, it has been shown in Dani et al. (2008) that the algorithm for path selection is polynomial time equivalent to a NPhard negative definite linearly constrained quadratic programming.

ExplorethenExploit: Instead of doing exploration and exploitation simultaneously, the DM can collect data to construct accurate estimates for all actions’ losses by first performing uniform exploration over all possible actions, and eliminates an action whenever it is confident that this action is suboptimal. This procedure runs until there is only one action left. It has been shown in Auer and Ortner (2010) that the adaptive exploration approach works well for the ordinary stochastic multiarmed bandits setting. A similar approach has been applied to our problem of interest by the authors of Liu and Zhao (2012), and they achieve a suboptimal instancedependent regret with an inefficient algorithm. Here is the rank of the path matrix
As it is unclear how to get the OFU approach to work efficiently in our setting, we adopt the explorethenexploit approach here. An immediate difficulty in implementing this approach is that the DM cannot afford to uniformly explore exponentially many paths. It’s thus of great importance to devise a way to efficiently collect data in the stochastic online shortest path routing setting.
3 Exploration Basis
In order to execute the uniform exploration efficiently, the DM relies on a basis for the network. Intuitively, a set is a basis for if it “spans” the set i.e., each path in can be expressed as a linear combination of the paths in If the DM is able to accurately estimates the delays of the basis paths, it can also construct accurate delay estimators for all the paths in thanks to the linearity property. It is worth noting that the concept of exploration basis has been raised in adversarial linear bandits before Awerbuch and Kleinberg (2004), and we review it here as it is going to be useful for our problem.
3.1 Barycentric Spanners and Network Identifiability
Note that we have several requirements for First of all, the paths of should come from i.e., so that the DM can select them. Next, the set should span the original path set i.e., Finally, denote the paths in as and suppose any path can be expressed as a linear combination of paths of i.e., there exits such that
(2) 
We require that the absolute value of any is bounded by some (small) positive constant i.e.,
(3) 
To see the rationale behind the last requirement, we decompose the estimation error on ’s delay as follows:
(4) 
Here is any estimate of From eq. (4), we can see that all the ’s should have small absolute values as otherwise, even small estimation error can be scale up drastically by any with large absolute values. To this end, we introduce the concept of barycentric spanner introduced by the authors of Awerbuch and Kleinberg (2004):
Definition 1 (Barycentric spanner Awerbuch and Kleinberg (2004)).
Let be a vector space over the real numbers, and a subset whose linear span is a dimensional subspace of A set is a barycentric spanner for if every may be expressed as a linear combination of elements of using coefficients in is the approximate barycentric spanner if every may be expressed as a linear combination of elements of using coefficients in
The authors of Awerbuch and Kleinberg (2004) also presented a result regarding the existence and search of barycentric spanner.
Proposition 1 (Awerbuch and Kleinberg (2004)).
Suppose is a compact set not contained in any proper linear subspace. Given an oracle for optimizing linear functions over for any we may compute a approximate barycentric spanner for in polynomial time, using calls to the optimization oracle.
The authors of Awerbuch and Kleinberg (2004) also present an algorithm for finding a approximate barycentric spanner for any . For completeness of presentation, we include this in Appendix B. The assumption stated in Proposition 1 that the set is not contained in any proper subspace is closely related to network identifiability. Informally, we say that a network with links is identifiable if its set of paths, spans the space . In Theorem 3.1 of Ma et al. (2013), the authors showed that it is in general impossible for to be identifiable if all the paths in originate from and end at the same pair of nodes, but Theorem 3.2 of Ma et al. (2013) also states that it is possible for a subgraph of to be identifiable. To accelerate our discussion, we call each of the links that is incident to either the source or the destination as an external link, and all other links the internal links. A network with both the source and destination nodes as well as all the external links of removed is called the internal network. In Fig. 1, links and are external links; while the rest are internal links. We can see that the internal network with node is identifiable as the paths and span the space To this end, we temporarily make the following additional assumption (to be relaxed in Section 6)
Assumption 1.
The internal network of is identifiable, and the expected delays of all the external links are known a priori. To avoid clutters, we further assume that the expected delays of the external links are deterministically 0.
With some abuse of notation, refers to the number of internal links whenever Assumption 1 is imposed, and it is equal to Given Proposition 1 and Assumption 1, the DM can pick a positive number first, and then implement Algorithm 5 in Appendix B to identify in polynomial time the approximate barycentric spanner i.e., for any path there exists some such that By the definition of approximate barycentric spanner, the maximal norm of over all is upper bounded by i.e.,
(5) 
4 ExplorethenCommit Algorithm: A WarmUp
In this section, we develop the ExplorethenCommit (EC) algorithm based on nonadaptive exploration to solve the problem.
4.1 Design Intuitions
The design of the EC algorithm follows an intuitive rationale: if the DM is able to recover the expected delay of each path of the
accurately, it will also be able to accurately estimate the expected delay of each path as the delay of each path is the linear combination of the elements in the barycentric spanner. Once the DM believes that the optimal path has been found with high probability, it could choose to commit to this path, and incurs low regret. To begin, we assume that the DM knows the minimum gap
We will later relax this assumption to obtain practical algorithms.4.2 Design Details
Given a positive integer we aim at getting a good estimate of in the first rounds, and then chooses the estimated best path in each of the remaining rounds. We thus call the first rounds as the exploration stage, and the remaining
rounds as the committing stage. The EC algorithm divides the exploration stage into epochs of length
and chooses each path in once in every epoch until the end of the exploration stage. Afterwards, the EC algorithm makes use of the Ordinary Least Square (OLS) estimator to construct an estimate for Specifically, the paths used in the first epochs (or rounds) form the design matrixand the observed losses form the response vector
The OLS estimator then gives us
(6) 
Thanks to the identifiability assumption, is full rank, and is welldefined. One can easily verify Finally, the EC algorithm applies an arbitrary shortest path algorithm to compute the path with the lowest estimated delay, and commits to this path in the exploitation stage.
4.3 Regret Analysis
To properly tune the parameter an essential tool is a deviation inequality on the estimation errors.
Theorem 1.
After epochs of explorations, the probability that there exists a path such that the estimated mean delay of deviates from its mean delay by at least is at most i.e.,
Proof.
We are now ready to present the regret bound of EC algorithm.
Theorem 2.
With the knowledge of EC algorithm has the following regret bounds:

Instancedependent regret:

Worst case regret:
Proof.
Please refer to Section A.2 for the complete proof. ∎
Remark 1.
The instancedependent regret bound obtained in Theorem 2 is a significant improvement compared to the direct application of OFU approach, and the worst case regret can be achieved without knowing Nevertheless, we should be aware that the choice of for the instancedependent regret bound relies on knowing which is never the case in practice.
Though being computationally efficient, the above remark indicates that the nonadaptive EC algorithm is not sufficient to achieve optimal regret bounds.
5 TopTwo Comparison Algorithm: An Adaptive Exploration Approach
As we have seen from the previous discussions, the nonadaptive EC algorithm fails to make full use of the observed delays to explore adaptively, and its success relies almost solely on knowing ahead of time.
In this section, we study adaptive exploration algorithms that have been shown to achieve nearly optimal regret bounds in stochastic MAB Auer and Ortner (2010); Slivkins (2017) to obtain nearly optimal instancedependent and worst case regret bounds. Different from those in ordinary stochastic MAB settings, the algorithm builds on top of a novel top two comparison (TTC) method to allow efficient computation. We start by attaining a nearly optimal instancedependent regret bound, and then show how to attain a nearly optimal worst case regret bound simultaneously.
5.1 Design Intuitions
Adaptive exploration algorithms often serve as an alternative for UCB algorithms in stochastic multiarmed bandits Auer and Ortner (2010); Slivkins (2017). In Auer and Ortner (2010); Slivkins (2017), the DM uniformly explores all remaining actions, and periodically executes an action elimination rule to ensure with high probability that:

The optimal action remains;

The suboptimal actions can be removed effectively.
until only one action is left, and commits to that action in the rest of the rounds. The adaptive exploration algorithms achieve optimal instancedependent and worst case regret bounds for stochastic multiarmed bandits.
We start by demonstrating how an adaptive exploration algorithm can achieve the nearly optimal instancedependent regret bound. Similar to the EC algorithm, the adaptive exploration algorithm also splits the rounds into an exploration stage and a committing stage: in each epoch of the exploration stage, the DM selects every path in once so that all of them have samples. To ease our presentation, we denote the estimated shortest path after epochs of uniform exploration as i.e.,
and follow Theorem 1 to denote the confidence bound as i.e.,
(7) 
We denote the total length of exploration stage by a random variable We then use a simple union bound to show the probability that there exists a path such that the estimated mean delay of deviates from its mean delay by at least at the end of any epoch in the committing stage can be upper bounded as
(8)  
where we have used Theorem 1 and the fact that in inequality (8). In other words, if we denote the event as following: any path ’s estimated delay is within distance from its true expected delay for all i.e.,
(9) 
then event holds with probability at least in the adaptive exploration algorithm. From inequality (1), we have and the worst possible total regret (i.e., choosing the path with maximum gap in each round) an algorithm can incur is we can tune properly, i.e., setting so that the regret incurred by the algorithm in case does not hold is at most Therefore, we only need to focus the case when holds.
Conditioned on we assert that the DM could detect if any of the remaining paths is suboptimal by checking whether
(10) 
holds at the end of each epoch Afterwards, the identified suboptimal paths are eliminated. We use Figure 2 to illustrate the rationale behind this criterion. Note that in both Fig. 2(a) and 2(b), the horizontal right arrow is the positive number axis.
In Fig. 2(a), suppose and lie at and respectively. Conditioned on event should locate within the interval while should locate within the interval Now if and are more than away from each other, then
(11) 
In other words, path is suboptimal as its expected delay is at least longer than
Similarly in Fig. 2(b), suppose and lie at and respectively. Conditioned on event should locate to the left of while should locate to the right of Now if then
(12) 
which means the suboptimal path is detected according to criterion (10).
We formalize these observations in the following lemma.
Lemma 1.
Conditioned on event if criterion (10) holds, then

path is suboptimal;

any suboptimal path with is detected.
Proof.
The proof follows from the above arguments. Please refer to Section A.3 for the complete proof. ∎
These two nice properties of criterion (10) jointly guarantees that the optimal path remains in , and any suboptimal path is removed once shrinks down to below Specifically, if arrives to a value that or
(13) 
follows from eq. (7), all suboptimal paths should have been eliminated.
Roughly speaking, conditioned on the regret of the adaptive algorithm is
(14) 
Recalling that the regret conditioned on is at most setting the expected regret of this algorithm is upper bounded as and we shall formalize this analysis in Theorem 3. Surprisingly, adaptivity saves us from a lack of knowledge on the exact value of
5.2 Efficient Implementation
One may note that implementing the criterion (10) requires an enumeration over the set which is typically exponential in size (in terms of ). In this subsection, we further propose an polynomial time implementation, namely the Top Two Comparison (TTC) algorithm, for our problem.
Different from the adaptive exploration algorithms proposed for stochastic multiarmed bandit problems Auer and Ortner (2010); Slivkins (2017), which uniformly explores the set of remaining actions, our strategy decouples the exploration basis from path elimination by making use of the approximate barycentric spanner In other words, the DM does not need to eliminate the suboptimal paths one by one. As the optimal path is unique by assumption, it can instead remove all of them at the same time once the difference between the delay of the estimated shortest path and the delay of the estimated second shortest path is larger than for some epoch
To find the estimated second shortest path, we make the observation that the estimated second shortest path should traverse at least one link that is different than those in the estimated shortest path. The DM could start by iteratively setting the delay of links traversed by the shortest path to a large number, i.e., one at a time, while keeping the estimated delays of all other links intact, and find the delay of the shortest path with respect to the “perturbed” estimated delay vector. Finally, the minimum delay over these “perturbed” delays is the second shortest delay.
5.3 Design Details
We are now ready to formally present the TTC algorithm. Following the design guidelines presented in Sections 5.1 and 5.2, the TTC algorithm initializes the set of remaining path as and divides the time horizon into epochs. In the epoch, TTC algorithm distinguishes two cases:

If contains only one path, TTC algorithm chooses this path, and sets

Otherwise, the TTC algorithm picks each path in once so that every path in has been selected times. It then computes the OLS estimate for and identifies the path with least estimated delay, i.e., and the path with estimated second shortest delay, i.e., via a second shortest path subroutine. Afterwards, TTC algorithm checks the gap between and If The set of remaining path for the epoch is denoted as otherwise,
The pseudocode of TTC algorithm is shown in Algorithm 1 and the pseudocode of the subroutine for finding second shortest path is shown in Algorithm 2. Please note that the algorithms are run in epochs (indexed by ), and can be represented by the incidence matrix of
5.4 Regret Analysis
The analysis essentially follows the intuition presented in Section 5.1, and the instancedependent regret of the TTC algorithm is given by the following theorem.
Theorem 3.
For any the instancedependent expected regret of TTC algorithm is bounded as
Proof.
Please refer to Section A.4 for the complete proof. ∎
We now comment on the bound provided in Theorem 3. In the worst case, i.e., when if the RHS of Theorem 3 is of order As the regret bound from adversarial linear bandits is of order this indicates that the instancedependent regret bound becomes meaningless once becomes smaller than Even though adaptive exploration saves us from not knowing it cannot achieve nearly optimal worst case regret bound automatically. This is because the TTC algorithm shares similar structure to EC algorithm, and as we have seen in Theorem 2 that tuning the parameter to achieve suboptimal worst case regret bound does not require any knowledge of either. Some other techniques are needed if we want to get nearly optimal instancedependent and worst case regrets at the same time.
5.5 Getting Nearly Optimal Worst Case Regret
It turns out that we can get nearly optimal instancedependent and worst case regrets at the same time with just a bit more effort. The key idea is to limit the length of the exploration stage so that once the smallest gap is believed to be smaller than with high probability, the DM switches to an efficient alternative algorithm for adversarial linear bandits to solve the problem. A candidate for the alternative algorithm can be found in Bubeck and Eldan (2015). Specifically, we set
and modifies the TTC algorithm as following:

For each epoch the DM runs the TTC algorithm;

If the set contains only one path, the DM selects this path in the rest of the rounds;

Else if the set contains more than one path, the DM finds that holds with probability at least and thus terminates the TTC algorithm, and runs the efficient algorithm for adversarial linear bandits in Bubeck and Eldan (2015) over the network to solve the problem.
We name this as the Modified Top Two Comparison (MTTC) algorithm, and its pseudocode is shown in Algorithm 3.
We are now ready to state the regret bound of MTTC algorithm.
Theorem 4.
For any the MTTC algorithm has the following regret bounds:

Instancedependent regret:

Worst case regret:
Proof.
Please refer to Section A.5 for the complete proof. ∎
6 General Networks
The success of the TTC algorithm and the MTTC algorithm in achieving nearly optimal regrets rely on the identifiability assumption, i.e., Assumption 1, which might be violated in practice. For example, if the network scale grows large, it is very likely that even the internal network of is not fully identifiable. Also, if the external links are shared among many entities, it is hard to obtain the expected delays of all the external links. For a general network, one possible way to find a approximate barycentric spanner is to project into some subspace so that it is still full rank in that subspace. But it is unclear how to implement the projection without enumerating all the paths in which is computationally inefficient. Therefore, we are in need of a new technique for our problem. In this section, we show how to implement the MTTC algorithm algorithm for general networks. We start by proposing an algorithm for finding a basis of when does not span . We note that any basis of is automatically approximate barycentric spanner of with some (possibly unknown at first) positive number We then state the difference in estimating between identifiable and general networks, and present a general version of OLS estimator with provable deviation property. Throughout this section, we shall assume that the rank of is
6.1 Additional Notation
In this section, we will make use of matrix notations heavily. For any matrix we use to denote its element at the row and column, and to denote its row and column vectors, respectively, and and to denote the matrices obtained by keeping only the to rows and to columns, respectively Moreover, and are the matrices obtained by removing the row and of respectively. is the by matrix obtained by removing the row and column of simultaneously.
6.2 Efficient Algorithm for Finding the Basis
As a first step, we present a greedy algorithm that finds the basis of even when the network is unidentifiable. Inspired by the algorithm for finding the approximate barycentric spanner for identifiable networks, i.e., Algorithm 5 in Appendix B, the highlevel idea of the algorithm can be described as following:

Initiate a matrix to the byidentity matrix;

Greedily replace as many columns of as possible by paths in while keeping full rank.

All the columns in that are obtained from constitute
Since steps (1) and (3) can be easily implemented, we further elaborate on an iterative algorithm for step (2). For ease of presentation, we use to denote the resulted matrix after the iteration with At the beginning of the iteration, suppose can be written as
(15) 
where are the columns obtained from while are the columns inherited from , the algorithm then finds a column such that replacing with an element in can result in a full rank matrix, and sets
(16) 
where is the column index of This algorithm terminates once such cannot be found in after some iterations
To efficiently implement the above iterative algorithm, i.e., to find such in each iteration if it exists, we note that the matrix is full rank if and only if the determinant of is nonzero, i.e.,
(17) 
For now, suppose we are given a full rank matrix if the column of is replaced by an to form
the determinant of can be written as a linear function of i.e.,
(18) 
by the Laplace expansion, and the value of can be computed efficiently using the LU decomposition. Now to find an index and that satisfies we can equivalently solve the following optimization problem
(19) 
for all If there exists some such that the solution satisfies we can then replace the column of by to form according to eq. (16).
For a given defining a vector with each entry defined by eq. (18), i.e.,
(20) 
the optimal solution of (19) can be obtained by first solving the following two subproblems
(21) 
and then picking the solution with larger absolute value. To solve the first subproblem, we can use the following steps:

Assign delay to link of for all

Compute the longest path. This requires a call to an appropriate efficient longest path algorithm for directed acyclic graphs, e.g., topological sorting Cormen et al. (2009).
The formal description of this algorithm for basis identification is shown in Algorithm 4.
We are now ready to prove the correctness of the algorithm, i.e., if the rank of is then Algorithm 4 returns a basis such that the rank of is
Lemma 2.
Proof.
Please refer to Section A.6 for the complete proof. ∎
Remark 2.
Although does not span we still develop an efficient algorithm for computing the basis of With some abuse of notation, we note that any basis of is automatically a approximate barycentric spanner of with some positive number i.e.,
(22) 
However, since does not span the space as required by Proposition 1, we cannot set arbitrarily first with the hope that we can find the corresponding approximate barycentric spanner using Algorithm 5 in Section B.
6.3 OLS Estimator for General Networks
With the new basis at hand, we can almost follow what we have developed in Section 5, i.e., eq. (6), to estimate But a more careful inspection suggests a crucial difference between identifiable network setting and the general network setting: since the by matrix is singular, i.e., for all As a result, we cannot compute the OLS estimate of the same as eq. (6).
To allow the DM to implement the MTTC algorithm for general networks, we need to resolve the issues raised by the singularity of To this end, we use a slightly different version of OLS estimator Rigollet and Hutter (2017), i.e., the OLS estimator of after epochs of explorations is
(23) 
where denotes the MoorePenrose pseudoinverse of We are now ready to state a new deviation inequality on the estimation errors. Here with some abuse of notations, we recall from inequality (3) that is the upper bound on the absolute value of for all and
Theorem 5.
For a given positive integer the probability that there exists a path such that the estimated mean delay of deviates from its mean delay by at least is at most after epochs of explorations, i.e.,
Proof.
Please refer to Section A.7 for the complete proof. ∎
6.4 Upper Bounding and Obtaining Low Regrets
By design of the MTTC algorithm, we only need to change the following parameters according to Theorem 5:
(24)  
(25)  
Comments
There are no comments yet.