# Online Learning of Network Bottlenecks via Minimax Paths

In this paper, we study bottleneck identification in networks via extracting minimax paths. Many real-world networks have stochastic weights for which full knowledge is not available in advance. Therefore, we model this task as a combinatorial semi-bandit problem to which we apply a combinatorial version of Thompson Sampling and establish an upper bound on the corresponding Bayesian regret. Due to the computational intractability of the problem, we then devise an alternative problem formulation which approximates the original objective. Finally, we experimentally evaluate the performance of Thompson Sampling with the approximate formulation on real-world directed and undirected networks.

## Authors

• 5 publications
• 2 publications
• 25 publications
06/16/2022

### A Contextual Combinatorial Semi-Bandit Approach to Network Bottleneck Identification

Bottleneck identification is a challenging task in network analysis, esp...
06/28/2014

### Efficient Learning in Large-Scale Combinatorial Semi-Bandits

A stochastic combinatorial semi-bandit is an online learning problem whe...
07/24/2017

### Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms: A Case with Bounded Regret

In this paper, we study the combinatorial multi-armed bandit problem (CM...
10/24/2019

### Minimax Regret of Switching-Constrained Online Convex Optimization: No Phase Transition

We study the problem of switching-constrained online convex optimization...
11/02/2020

### At most 4.47^n stable matchings

We improve the upper bound for the maximum possible number of stable mat...
12/28/2021

### Adaptive Client Sampling in Federated Learning via Online Learning with Bandit Feedback

In federated learning (FL) problems, client sampling plays a key role in...
05/26/2020

### Memory-Efficient Sampling for Minimax Distance Measures

Minimax distance measure extracts the underlying patterns and manifolds ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bottleneck identification constitutes an important task in network analysis, with applications including transportation planning and management (berman1987optimal), routing in computer networks (shacham1992multicast) and various bicriterion path problems (hansen80). The path-specific bottleneck on a path between a source and a target node in a network is defined as the edge with a maximal cost or weight according to some criterion such as transfer time, load, commute time, distance, etc. Then, the goal of bottleneck identification and avoidance is to find a path whose bottleneck is minimal. Thus, one may model bottleneck identification as the problem of computing the minimax edge over the given network/graph, to obtain an edge with a minimal largest gap between the source and target nodes. Equivalently, it can be formulated as a widest path problem or maximum capacity path problem (pollack1960letter) where the edge weights have been negated.

The aforementioned formulations assume that the network or the graph is fully specified, i.e., that all the edge weights are fully known. However, in practice, the edge weights might not be known in advance or they might include some inherent uncertainty. To tackle such situations, in this paper, we develop an online learning framework to learn the edge weight distributions of the underlying network while solving the bottleneck identification problem, for different problem instances. For this purpose, we view this as a multi-armed bandit (MAB) problem and focus on Thompson Sampling (TS) (thompson1933), a method that suits probabilistic online learning well.

Thompson Sampling is an early Bayesian method for addressing the trade-off between exploration and exploitation in sequential decision making problems. It has only recently been thoroughly evaluated through experimental studies (ChapelleL11; GraepelCBH10) and theoretical analyses (KaufmannKM12; agrawal2012analysis; russo2014learning), where it has been shown to be asymptotically optimal in the sense that it matches well-known lower bounds of these types of problems (lai1985asymptotically).

Among many other problem settings, Thompson Sampling has been adapted to online versions of combinatorial optimization problems with retained theoretical guarantees

(wang2018thompson), where one application is to find shortest paths in graphs (liu2012adaptive; gai2012combinatorial; zou2014online; ijcai2020-0284).

Another commonly used method for these problems is Upper Confidence Bound (UCB) (Auer02), which utilizes optimism to balance exploration and exploitation. UCB has been adapted to combinatorial settings (chen2013combinatorial), and also exists in Bayesian variants (kaufmann2012bayesian). Recently, a variant of UCB has been studied for bottleneck avoidance problems in a combinatorial pure exploration setting (du2021combinatorial). They consider a different problem setting and method than ours, though their bottleneck reward function is similar to the one we use in our approximation method.

In this paper, we model the online bottleneck identification task as a stochastic combinatorial semi-bandit problem for which we develop a combinatorial variant of Thompson Sampling. We then derive an upper bound on the corresponding Bayesian regret that is tight up to a polylogarithmic factor, consistent with the existing lower bounds for combinatorial semi-bandit problems. We face the issue of computational intractability with the exact problem formulation. We thus propose an approximation scheme, along with a theoretical analysis of its properties. Finally, we experimentally investigate the performance of the proposed method on directed and undirected real-world networks from transport and collaboration domains.

## 2 Bottleneck Identification Model

In this section, we first introduce the bottleneck identification problem over a fixed network and then describe a probabilistic model to be used in stochastic and uncertain situations.

### Bottleneck identification over a network

We model a network by graph , where denotes the set of vertices (nodes) and each indicates an edge between the vertices and where and . Moreover, is a weight function defined for each edge of the graph, where for convenience, we use to denote the weight of edge . If is directed, the pair is ordered, otherwise, it is not (i.e., for undirected graphs). A path from vertex (source) to vertex (target) over is a sequence of vertices where , and . It can also be seen as a sequence of edges .

As mentioned, a bottleneck on a path can be described as an edge with a maximal weight on that path. To find the smallest feasible bottleneck between the source node and the target node , we consider all the paths between them. For each path, we pick an edge with a maximal weight, to obtain all path-specific bottlenecks. We then identify the smallest path-specific bottleneck edge in order to find the best feasible bottleneck, i.e., such that larger bottlenecks are avoided.

Therefore, given graph , the bottleneck edge between and can be identified via extracting the minimax edge between them. With denoting the set of all possible paths from to over , the bottleneck can be computed by

 b(u,v;G)=minp∈Pu,vmaxe∈pwe, (1)

The quantity in Eq. 1 satisfies the (ultra) metric properties under some basic assumptions on the edge weights such as symmetry and nonnegativity. Hence, it is sometimes used as a proper distance measure to extract manifolds and elongated clusters in a non-parametric way (Chehreghani20Minimx; KimC07).

However, in our setting, such conditions do not need to be fulfilled by the edge weights. In general, we tolerate positive as well as negative edge weights, and we assume the graph might directed, i.e., the edge weights are not necessarily symmetric. Therefore, despite the absence of (ultra) metric properties, the concept of minimax edges is still relevant for bottleneck identification.

To compute the minimax edge, one does not need to investigate all possible paths between the source and target nodes, which might be computationally infeasible. As studied in (Hu61), minimax edges and paths over an arbitrary undirected graph are equal to the minimax edges over any minimum spanning tree (MST) computed over that graph. This equivalence simplifies the calculation of minimax edges, as there is only one path between every two vertices over an MST, whose maximal edge weight yields the minimax edge, i.e., the desired bottleneck.

For directed graphs, an MST might not represent the minimax edges in a straightforward manner. Hence, we instead rely on a modification (berman1987optimal) of Dijkstra’s algorithm (dijkstra1959note) to extract minimax paths rather than the shortest paths.

### Probabilistic model for bottleneck identification

As mentioned, we study bottleneck identification in uncertain and stochastic settings. Therefore, instead of considering the weights for

to be fixed, we view them as stochastic with fixed, albeit unknown, distribution parameters. Additionally, we assume that the weight of each edge follows a Gaussian distribution with known and bounded variance. The Gaussian edge weight assumption is common for many important problem settings, like minimization of travel time

(seshadri2010algorithm) or energy consumption (ijcai2020-0284) in road networks. Furthermore, we assume that all edge weights are mutually independent. Hence,

 we∼N(θ∗e,σ2e)

where denotes the unknown mean of edge , and is the known variance. Without loss of generality, we assume the bounded variance . However, we emphasize that we do not assume that and are bounded or non-negative.

It is convenient to be able to make use of prior knowledge in online learning problems where the action space is large, which motivates a Bayesian approach where we assume that the unknown mean is sampled from a known prior distribution:

 θ∗e∼N(μe,0,ς2e,0)

We use a Gaussian prior for since it is conjugate to the Gaussian likelihood and allows for efficient recursive updates of posterior parameters upon a new weight observations at time :

 ς2e,t+1←(1ς2e,t+1σ2e)−1 (2) μe,t+1←ς2e,t+1(μe,tς2e,t+we,tσ2e) (3)

Since our long-term objective is to find a path which minimizes the expected maximum edge weight along that path, we need a framework to sequentially select paths to update these parameters and learn enough information about the edge weight distributions.

## 3 Online Bottleneck Learning Framework

Consider a stochastic combinatorial semi-bandit problem (cesa2012combinatorial) with time horizon , formulated as a problem of cost minimization rather than reward maximization. There is a set of base arms from which we may at each time step select a subset (or super arm) . The selection is further restricted such that , where is called the set of feasible super arms.

Upon selection of , the environment reveals a feedback drawn from some fixed and unknown distribution for each base arm (i.e., semi-bandit feedback). Furthermore, the we receive a super arm cost from the environment, , i.e., the maximum of all base arm feedback for the selected super arm and the current time step. The objective is to select super arms to minimize . This objective is typically reformulated as an equivalent regret minimization problem, where the (expected) regret is defined as

 Regret(T):=⎛⎝∑t∈[T]E[c(at)]⎞⎠−T⋅mina∈IE[c(a)] (4)

To connect this to the probabilistic bottleneck identification model introduced in the previous section, we let each edge in the graph correspond to exactly one base arm . For the online minimax path problem, the feasible set of super arms is then the set of all admissible paths in the graph, where the paths are directed or undirected depending on the type of graph. The feedback of each base arm is simply the Gaussian weight of the matching edge , with known variance and unknown mean .

We denote the expected cost of a super arm , where

is a mean vector and

such that . For Bayesian bandit settings and algorithms, it is common to consider the notion of Bayesian regret, with an additional expectation over problem instances drawn from the prior distribution:

 BayesRegret(T):=E[Regret(T)]

### Thompson Sampling with exact objective

It is not sufficient to find the super arm which minimizes in each time step , since a strategy which is greedy

with respect to possibly imperfect current cost estimates may converge to a sub-optimal super arm. Thompson Sampling is one of several methods developed to address the trade-off between exploration and exploitation in stochastic online learning problems. It has been shown to exhibit good performance in many formulations, e.g., linear contextual bandits and combinatorial semi-bandits.

The steps performed in each time step by Thompson Sampling, adapted to our setting, are displayed in Algorithm 1. First, a mean vector is sampled from the current posterior distribution (or from the prior in the first time step). Then, an arm is selected which minimizes the expected cost

with respect to the sampled mean vector. These first two steps are equivalent to selecting the arm according to the posterior probability of it being optimal. In combinatorial semi-bandit problems, the method of finding the best super arm according to the sampled parameters is often called an

oracle.

When the super arm is played, the environment reveals the feedback if and only if , which is a property called semi-bandit feedback. Finally, these observations are used to update the posterior distribution parameters.

### Regret analysis of TS for minimax paths

We use the technique to analyze the Bayesian regret of Thompson Sampling for general bandit problems introduced by (russo2014learning) and further elaborated by (slivkins2019), carefully adapting it to our problem setting. This technique was originally devised to enable convenient conversion of existing UCB regret analyses to Thompson Sampling, but can also be applied to novel TS applications. In the rest of this section, we outline the most important steps of the proof of Theorem 1, leaving technical details to the supplementary material.

###### Theorem 1.

The Bayesian regret of Algorithm 1 is .

We initially define a sequence of upper and lower confidence bounds, for each time step :

 Ut(a):=f^θt−1(a)+maxi∈a√32logTNt−1(i) Lt(a):=f^θt−1(a)−maxi∈a√32logTNt−1(i)

where is the average feedback of base arm until time and is the number of times base arm has been played as part of a super arm until time .

###### Lemma 2.

For Algorithm 1, we have that .

This Bayesian regret decomposition is a direct application of Proposition 1 of (russo2014learning). It utilizes the fact that given the history of selected arms and received feedback until time , the played super arm and the best possible super arm are identically distributed under Thompson Sampling. Furthermore, also given the history, and are deterministic functions of the super arm , enabling the decomposition of the regret into terms of the expected confidence width, the expected overestimation of the super arm with least mean cost, and the expected underestimation of the selected super arm. By showing that

with high probability, we can bound the last two of these terms.

###### Lemma 3.

For any , we have that and .

Both terms are bounded in the same way, for which we need a few intermediary results. Focusing on the overestimation of the optimal super arm, we can see that:

 E[Lt(a∗)−fθ∗(a∗)]= E[f^θt−1(a∗)−fθ∗(a∗)−maxi∈a∗√32logTNt−1(i)]

First, in Lemma 4 the difference expressed in the first two terms between the true mean cost of a super arm and the corresponding estimated mean is bounded by the maximum of the differences of the true and estimated means of each individual base arm feedback, such that:

###### Lemma 4.

For any super arm and time step , we have that .

This is achieved by decomposing the absolute value into a sum of the positive and negative portions of the difference, then bounding each individually. Focusing on the positive portion by assuming that , and letting , , and , for , we can see that:

 f^θt−1(a∗)−fθ∗(a∗)= E[maxi∈aZi]−E[maxi∈aYi]= E[maxi∈a(Qi+δi,t−1)]−E[maxi∈aYi]≤ E[maxi∈aQi]+maxi∈aδi,t−1−E[maxi∈aYi]= maxi∈aδi,t−1

The negative portion is bounded in the same way, directly leading to the result of Lemma 4. With this result, we can proceed with Lemma 3, where we let :

 E[2maxi∈a∗|δi,t−1|−maxi∈a∗√32logTNt−1(i)]≤ E⎡⎣2maxi∈a∗[|δi,t−1|−√8logTNt−1(i)]+⎤⎦≤ 2∑i∈AE⎡⎣[|δi,t−1|−√8logTNt−1(i)]+⎤⎦= 2∑i∈APr{|δi,t−1|>√8logTNt−1(i)}⋅ (5) E[|δi,t−1|−√8logTNt−1(i)∣∣∣|δi,t−1|>√8logTNt−1(i)] (6)

The probability in Eq. 5 is of the event that the difference between the estimated and true means of an arm exceeds the confidence radius , while Eq. 6 is the expected difference conditional on that event. We bound Eq. 5 with Lemma 5 and Eq. 6 with Lemma 6.

###### Lemma 5.

.

It is now sufficient to show that the difference is small for all base arms with high probability, which we accomplish using a standard concentration analysis through application of Hoeffding inequality and union bounds.

###### Lemma 6.

For any and , we have

 E[|δi,t−1|−√8logTNt−1(i)∣∣∣|δi,t−1|>√8logTNt−1(i)]≤1

Though the rewards are unbounded, this expectation can be bounded by utilizing the fact that the mean of a truncated Gaussian distribution is increasing in the mean of the distribution before truncation, by Theorem 2 of (horrace2015moments). We can see that that:

 E[|δi,t−1|−√8logTNt−1(i)∣∣∣|δi,t−1|>√8logTNt−1(i)]= E[δi,t−1−√8logTNt−1(i)∣∣∣δi,t−1−√8logTNt−1(i)>0]≤ E[δi,t−1∣∣∣δi,t−1>0]

We know that is zero-mean Gaussian with variance at most one, hence .

With the result from Lemma 3, the last two terms of the regret decomposition in Lemma 2 are bounded by constants in . Focusing on remaining term, we just need to show that to prove Theorem 1:

 ∑t∈[T]E[Ut(at)−Lt(at)]= √128logT∑t∈[T]E[maxi∈at1√Nt−1(i)]≤ √128logT∑t∈[T]E[∑i∈at1√Nt−1(i)]= √128logT∑i∈AE[∑t:i∈at1√Nt−1(i)]≤ √128logT∑i∈AE[2√NT(i)]≤ √128logT⋅E⎡⎢⎣2√d∑i∈ANT(i)⎤⎥⎦≤ √128logT⋅E[2√d2T]= 2d√128TlogT= O(d√TlogT)

We note that the final upper bound is tight up to a polylogarithmic factor, according to existing lower bounds for combinatorial semi-bandit problems (kveton2015tight).

### Thompson Sampling with approximate objective

Unfortunately, exact expressions for computing the expected maximum of Gaussian random variables only exist when the variables are few. In other words, we cannot compute

exactly for a super arm containing many base arms, necessitating some form of approximation approach. While it is possible to approximate through e.g., Monte Carlo simulations, we want to be able to perform cost minimization step using an efficient oracle. Therefore, we propose an approximation method outlined in Algorithm 2, where the minimization step has been modified from Algorithm 1 with an alternative super arm cost function .

Switching objectives from finding the super arm which minimizes the expected maximum base arm feedback to instead minimize the maximum expected feedback has the benefit of allowing us to utilize the efficient deterministic minimax path algorithms introduced earlier for both directed and undirected graphs.

It is possible to use alternative notions of regret to evaluate combinatorial bandit algorithms with approximate oracles (chen2013combinatorial; chen2016combinatorial). For our experimental evaluation of Algorithm 2, we introduce the following definition of approximate regret:

 ApproxRegret(T):=⎛⎝∑t∈[T]~fθ∗(at)⎞⎠−T⋅mina∈I~fθ∗(a)

An alternative Bayesian bandit algorithm which can be used with the alternative objective is BayesUCB (kaufmann2012bayesian), which we use as a baseline for our experiments. Like Thompson Sampling, BayesUCB has been adapted to combinatorial semi-bandit settings (nuara2018combinatorial; ijcai2020-0284). Whereas Thompson Sampling in Algorithm 2

encourages exploration by applying the oracle to parameters sampled the posterior distribution, with BayesUCB the oracle is instead applied to optimistic estimates based on the posterior distribution. In practice, this is accomplished for our cost minimization problem by using lower quantiles of the posterior distribution of each base arm. This principle of selecting plausibly optimal arms is called

optimism in the face of uncertainty and is the underlying idea of all bandit algorithms based on UCB.

To connect the different objectives in Algorithm 1 and Algorithm 2, we note that by Jensen’s inequality, and that the approximation objective consequently will underestimate super arm costs. However, we establish an upper bound on this difference through Theorem 7.

###### Theorem 7.

Given the optimal super arm for Algorithm 1 and the optimal super arm for Algorithm 2, we have that .

For any super arm , let for be Gaussian random variables with . Furthermore, let , such that . Then, the following holds.

 E[maxi∈aYi]= E[maxi∈a(Wi+θ∗i)]≤ E[maxi∈aWi]+maxi∈aE[Yi]≤ √2logd+maxi∈aE[Yi],

where the last inequality is due to Lemma 9 in (orabona2015optimal) and since for all . We also note that, by Jensen’s inequality, we have . Moreover, by definition we know that and . Consequently, we have,

 maxi∈~a∗E[Yi]≤maxi∈a∗E[Yi]≤ E[maxi∈a∗Yi]≤E[maxi∈~a∗Yi]≤ √2logd+maxi∈~a∗E[Yi]

Hence, we can conclude that

 fθ∗(~a∗)−fθ∗(a∗)= √2logd.

In other words, Theorem 7 holds and the optimal solutions of the exact Algorithm 1 and the approximate Algorithm 2 differ by at most . This bound is independent of the mean vector , depending only on the number of base arms and that the variance is bounded.

## 4 Experimental Results

In this section, we conduct bottleneck identification experiments using Algorithm 2 for two real-world applications, i. road (transport) networks, and ii. collaboration (social) networks. These experiments are performed with an extended version of the simulation framework in (tstutorial) and evaluated using our approximate definition of regret.

In the supplementary material, we provide more details on the experimental results. In addition, there, we compare Algorithm 1 to Algorithm 2 through a toy example.

Identification of bottlenecks in a road network is a vital tool for traffic planners to analyze the network and prevent congestion. In this application, our goal is to find the bottleneck between a source and a target, i.e., a road segment which is necessary to pass and also has minimal traffic flow. In the road network model, we let the nodes represent intersections and the directed edges represent road segments, with travel time divided by distance (seconds per meter) as edge weights. The bottleneck between a pair of intersections is the minimum bottleneck over all paths connecting them, where the bottleneck for each of these paths is the largest weight over all road segments along it. Note that in order for the bottleneck between a pair of intersections to have a meaning, there needs to exist at least one path connecting them.

We collect road networks of four cities, shown in Table 1, from OpenStreetMap (OpenStreetMap), where the average travel time as well as the distance is provided for each (directed) edge. We simulate an environment with the stochastic edge weights sampled from , where the observation noise is . For the experiments, the environment samples the true unknown mean from the known prior , where , and is the average travel time divided by distance recorded by OpenStreetMap (OSM).

We consider a greedy agent (GR) as a baseline, which always chooses the path with the lowest current estimate of expected cost. We evaluate the Thompson Sampling agent (TS) and the BayesUCB agent (B-UCB) in relation to the baseline. We run the simulations with all different agents (GR, TS, B-UCB) for each road network and report cumulative regret at a specific horizon , averaged over five repetitions. The horizon is chosen such that the instant regret is almost stabilized for all three agents.

Figure 0(a) illustrates the average cumulative regret for the Manhattan road network over time, where at the horizon the TS agent yields the lowest cumulative regret. Then, B-UCB follows TS and achieves a better result compared to the greedy baseline. Figure 0(b) shows average instant regret, which is high for the TS and B-UCB agents in the beginning; however, further on, one can see that first TS and then B-UCB start saturating, by achieving sufficient exploration. Figure 0(c) visualizes the Manhattan road network, where the paths explored by the TS agent are shown in red. The road segments that are explored more by the TS agent are displayed as more opaque.

Figure 2 shows the average cumulative regret for the three other road networks, respectively the cities of Salzburg, Eindhoven and Oslo. The results imply that TS yields the smallest cumulative regret, and then the cumulative regret of B-UCB is smaller than GR. Instant regret plots and road networks for these cities are provided in the supplementary material.

### Collaboration network

We consider a collaboration network from computational geometry (Geom) (jones2002computational) as an application of our approach to social networks. More specifically, we use the version provided by (handcock2003statnet) and distributed among the Pajek datasets at (pajek) where certain author duplicates, occurring in minor or major name variations, have been merged. The (handcock2003statnet) version is based on the BibTeX bibliography (beebe2002), to which the database (jones2002computational) has been exported. The network has 9072 vertices representing the authors and 22577 edges with the edge weights representing the number of mutual works between a pair of authors.

We simulate an environment where each edge weight is sampled as , within which is regarded as the true (negative) mean number of shared publications between a pair of authors linked by the edge , and the observation noise is . Furthermore, in this experiment, while the true negative mean number of mutual publications are assumed (by the agent) to be distributed according to the prior with , we instead generate the mean from a wider prior , simulating a scenario where the prior belief of the agent is too high. The assumed mean of the prior is however consistent with the distribution from which is sampled, and is directly determined by the pairwise negative number of mutual collaborations from the dataset (handcock2003statnet).

Figure 3 shows the cumulative and instant regret, averaged over five runs for the three agents with horizon , again chosen such that the instant regret is stabilized for all agents. One can see that in Figure 2(a), the TS agent reaches the lowest cumulative regret. Figure 2(b) suggests that as we go further in time, the instant regret of the TS and B-UCB agents approaches zero, where this happens earlier for the TS agent.

## 5 Conclusion

We developed an online learning framework for bottleneck identification in networks via minimax paths. In particular, we modeled this task as a combinatorial semi-bandit problem for which we proposed a combinatorial version of Thompson Sampling. We then established an upper bound on the Bayesian regret of the Thompson Sampling method. To deal with the computational intractability of the problem, we devised an alternative problem formulation which approximates the original objective. Finally, we investigated the framework on several directed and undirected real-world networks from transport and collaboration domains. Our experimental results demonstrate its effectiveness compared to alternatives such as greedy and B-UCB methods.

## Acknowledgements

This work is funded by the Strategic Vehicle Research and Innovation Programme (FFI) of Sweden, through the project EENE. We want to thank Emilio Jorge, Emil Carlsson and Tobias Johansson for insightful discussions around the proofs.

## Supplementary Material

### A Regret analysis

Here, we include detailed proofs for the theorems and lemmas in the main paper. We use the technique to analyze the Bayesian regret of Thompson Sampling for general bandit problems outlined by (russo2014learning) and further detailed by (slivkins2019), carefully adapting it to our problem setting.

###### Theorem 1.

The Bayesian regret of Algorithm 1 is .

###### Proof.

By Lemma 2 combined with Lemma 3, we have

 BayesRegret(T)≤8d+∑t∈[T]E[Ut(at)−Lt(at)]= 8d+2∑t∈[T]E[maxi∈at√32logTNt−1(i)]= 8d+√128logT∑t∈[T]E[maxi∈at1√Nt−1(i)]≤ 8d+√128logT∑t∈[T]E[∑i∈at1√Nt−1(i)]= 8d+√128logT∑i∈AE[∑t:i∈at1√Nt−1(i)]= 8d+√128logT∑i∈AE⎡⎣NT(i)+1∑j=11√j⎤⎦≤ (See proof of Lemma 1 in (russo2014learning)) 8d+2√128logT∑i∈AE[√NT(i)]≤ (arithmetic mean vs. quadratic mean inequality) 8d+2√128logT⋅E⎡⎢⎣√d∑i∈ANT(i)⎤⎥⎦≤ 8d+2√128logT⋅E[√d2T]= 8d+2d√128TlogT= O(d√TlogT)

###### Lemma 2.

For Algorithm 1, we have that .

###### Proof.

By Proposition 1 in (russo2014learning), we can decompose the Bayesian regret of the algorithm in the following way:

 BayesRegret(T)=∑t∈[T]E[fθ∗(at)−Lt(at)]+∑t∈[T]E[Lt(a∗)−fθ∗(a∗)]= ∑t∈[T]E[Ut(at)−Lt(at)]+∑t∈[T]E[fθ∗(at)−Ut(at)]+∑t∈[T]E[Lt(a∗)−fθ∗(a∗)]

###### Lemma 3.

For any , we have that and .

###### Proof.
 E[Lt(a∗)−fθ∗(a∗)]= E[f^θt−1(a∗)−fθ∗(a∗)−maxi∈a∗√32logTNt−1(i)]≤ (By Lemma 4) E[2maxi∈a∗|^θi,t−1−θ∗i|−maxi∈a∗√32logTNt−1(i)]= (Let j=argmaxi∈a∗|^θi,t−1−θ∗i|) E[2|^θj,t−1−θ∗j|−maxi∈a∗√32logTNt−1(i)]≤ E[2|^θj,t−1−θ∗j|−√32logTNt−1(j)]≤ E⎡⎣2[|^θj,t−1−θ∗j|−√8logTNt−1(j)]+⎤⎦≤ E⎡⎣2∑i∈A[|^θi,t−1−θ∗i|−√8logTNt−1(i)]+⎤⎦= 2∑i∈AE⎡⎣[|^θi,t−1−θ∗i|−√8logTNt−1(i)]+⎤⎦= 2∑i∈AE[|^θi,t−1−θ∗i|−√8logTNt−1(i)∣∣∣|^θi,t−1−θ∗i|>√8logTNt−1(i)]% Pr{|^θi,t−1−θ∗i|>√8logTNt−1(i)}≤ (By Lemma 6) 2∑i∈APr{|^θi,t−1−θ∗i|>√8logTNt−1(i)}≤ (By Lemma 5) 2∑i∈A2T≤ 4dT

The proof for is done in the same way. ∎

###### Lemma 4.

For any super arm and time step , we have that .

###### Proof.

Let and for be Gaussian random variables with and .

 |fθ∗(a)−f^θt−1(a)|= ∣∣∣E[maxi∈aYi]−E[maxi∈aZi]∣∣∣= (Define [x]+:=max(0,x)) [E[maxi∈aYi]−E[maxi∈aZi]]++[E[maxi∈aZi]−E[maxi∈aYi]]+= (Let δi,t−1:=θ∗i−^θi,t−1, Oi:=Yi+δi,t−1, Qi:=Zi−δi,t−1) [E[maxi∈aOi+maxi∈a(−δi,t−1)]−E[maxi∈aZi]]++[E[maxi∈aQi+maxi∈aδi,t−1]−E[maxi∈aYi]]+= [E[maxi∈aOi]+maxi∈a(−δi,t−1)−E[maxi∈aZi]]++[E[maxi∈aQi]+maxi∈aδi,t−1−E[maxi∈aYi]]+= (Since E[maxi∈aOi]=E[maxi∈aZi] and E[maxi∈aQi]=E[maxi∈aYi]) [maxi∈a(−δi,t−1)]++[maxi∈aδi,t−1]+≤ 2maxi∈a|δi,t−1|= 2maxi∈a|θ∗i−^θi,t−1|