1 Introduction
Today, electric vehicles experience a fastgrowing role in transport systems. However, the applicability of these systems are often confined with the limited capacity of their batteries. A common concern of electric vehicles is the socalled “range anxiety” issue. Due to the historically high cost of batteries, the range of electric vehicles has generally been much shorter than that of conventional vehicles, which has led to the fear of being stranded when the battery is depleted. Such concerns could be alleviated by improving the navigation algorithms and route planning methods for these systems. Therefore, in this paper we aim at developing principled methods for energyefficient navigation of electric vehicles.
Several works employ variants of shortest path algorithms for the purpose of finding the routes that minimize the energy consumption. Some of them, e.g. [2, 22], focus on computational efficiency in searching for feasible paths where the constraints induced by limited battery capacity are satisfied. Both works use energy consumption as edge weights for the shortest path problem. They also consider recuperation of energy modeled as negative edge weights, since they identify that negative cycles can not occur due to the law of conservation of energy. In [22]
a consistent heuristic function for energy consumption is used with a modified version of A*search to capture battery constraints at querytime. In
[5], instead of using fixed scalar energy consumption edge weights, the authors use piecewise linear functions to represent the energy demand, as well as lower and upper limits on battery capacity. This task has also been developed beyond the shortest path problems in the context of the wellknown vehicle routing problem (VRP). In [4], VRP is applied to electrified commercial vehicles in a twostage approach, where the first stage consists of finding the paths between customers with the lowest energy consumption and at the second stage the VRP including optional public charging station nodes is solved.The aforementioned methods either assume the necessary information for computing the optimal path is available, or do not provide any satisfactory exploration to acquire it. Thereby, we focus on developing an online framework to learn (explore) the parameters of the energy model adaptively alongside solving the navigation (optimization) problem instances. We will employ a Bayesian approach to model the energy consumption for each road segment. The goal is to learn the parameters of such an energy model to be used for efficient navigation. Therefore, we will develop an online learning framework to investigate and analyze several exploration strategies for learning the unknown parameters.
Thompson Sampling (TS) [23], also called posterior sampling and probability matching, is a modelbased exploration method for an optimal tradeoff between exploration and exploitation. Several experimental [7, 13, 9] and theoretical studies [18, 6, 16] have shown the effectiveness of Thompson Sampling in different settings. [9] develop an online framework to explore the parameters of a decision model via Thompson Sampling in the application of interactive troubleshooting. [24]
use Thompson Sampling for combinatorial semibandits including the shortest path problem with Bernoulli distributed edge costs, and derives distributiondependent regret bounds.
Upper Confidence Bound (UCB) [3] is another approach used widely for explorationexploitation tradeoff. A variant of UCB for combinatorial semibandits is introduced and analyzed in [8]. A Bayesian version of the Upper Confidence Bound method is introduced in [15] and later analyzed in terms of regret bounds in [17]. An alternative approach is proposed in [20].
Beyond the novel online learning framework for energyefficient navigation, we further extend our algorithms to the multiagent setting, where multiple vehicles adaptively navigate and learn the parameters of the energy model. We then extensively evaluate the proposed algorithms on several synthetic navigation tasks, as well as on a realworld setting on Luxembourg SUMO Traffic dataset. Our results demonstrate the effectiveness of our online learning framework, where the proposed algorithms significantly outperform the existing heuristics deployed in the realworld navigation tasks.
2 Energy Model
We model the road network by a directed graph where each vertex represents an intersection of the road segments, and indicates the set of edges. Each edge is a pair of vertices such that and it represents the road segment between the intersections associated with and . In the cases where bidirectional travel is allowed on a road segment represented by , we add an edge in the opposite direction. A directed path is a sequence of vertices , where for and for . Hence, a path can also be viewed as a sequence of edges. If starts and ends with the same vertex, is called a cycle.
We associate a weight function to each edge in the graph, representing the total energy consumed by a vehicle traversing that edge. The weight function is extended to a path by letting . We may define other functions to specify the other attributes associated with intersections and road segments, such as the average speed , the length , and the inclination .
In our setting, the energy consumptions at different road segments are stochastic and a priori unknown. We adopt a Bayesian approach to model the energy consumption at each road segment , i.e., the ’s. Such a choice provides a principled way to induce prior knowledge. Furthermore, as we will see, this approach fits well with the online learning and exploration of the parameters of the energy model.
We first consider a deterministic model of vehicle energy consumption for an edge , which will be used later as the prior. Similar to the recent model in [4], our model is based on longitudinal vehicle dynamics. For convenience, we assume that vehicles drive with constant speed along individual edges so that we can disregard the longitudinal acceleration term. We then have the following equation for the approximated energy consumption (in watthours)
(2.1)  
In Eq. 2.1 the vehicle mass , the rolling resistance coefficient , the front surface area and the air drag coefficient are vehiclespecific parameters. Whereas, the road segment length , speed and inclination angle are location (edge) dependent. We treat the gravitational acceleration and air density as constants. The powertrain efficiency is vehicle specific and can be approximated as a quadratic function of the speed or by a constant for an ideal vehicle with no batterytowheel energy losses.
Actual energy consumption can be either positive (traction and auxiliary loads like air conditioning) or negative (regenerative braking). If the energy consumption is modeled accurately and used as in a graph , the law of conservation of energy guarantees that there exists no cycle in where
. However, since we are estimating the expected energy consumption from observations, this guarantee does not necessarily hold in our case.
Thereby, while modeling energy recuperation is desirable from an accuracy perspective, it introduces some difficulties. In terms of computational complexity, Dijkstra’s algorithm does not allow negative edge weights and BellmanFord’s algorithm is slower by an order of magnitude. While there are methods to overcome this, they still assume that there are no negative edgeweight cycles in the network. Hence, we choose to only consider positive edgeweights when solving the energyefficient (shortest path) problem. This approximation should still achieve meaningful results, since even with discarding recuperation, edges with high energy consumption will still be avoided. So while the efficiency function has a higher value when the energy consumption is negative than when it is positive, we believe using a constant is a justified simplification as we only consider positive edgelevel energy consumption in the optimization stage.^{2}^{2}2We emphasize that our generic online learning framework is independent of such approximations, and can be employed with any senseful energy model.
Motivated by [25], as the first attempt, we assume the observed energy consumption of a road segment represented by an edge
follows a Gaussian distribution, given a certain small range of inclination, vehicle speed and acceleration. We also assume that
is independent from for all where and that we may observe negative energy consumption. The likelihood is thenHere, for clarity we assume the noise variance
is given. We can then use a Gaussian conjugate prior over the mean energy consumption:
where we choose and for some constant . Due to the conjugacy properties, we have closedform expressions for updating the posterior distributions with new observations of . For any path in , we have , which means we can find the path with the lowest expected energy demand if we set and solve the shortest path problem over . To deal with , we instead set where
is distributed according to the rectified normal distribution
, which is defined so that and . The expected value is then calculated as , where and are the standard Gaussian CDF and PDF respectively.Alternatively, instead of assuming a rectified Gaussian distribution for the energy consumption of each edge, we model the nonnegative edge weights by (conjugate) LogGaussian likelihood and prior distributions. By definition, if we have a LogGaussian random variable
, then the logarithm of is a Gaussian random variable . Therefore, we have the expected value , the variance and the mode . We can thus define the likelihood as(2.2) 
where and
. We also choose the prior hyperparameters such that
and , where and are calculated in the same way as for the Gaussian prior, in order to make fair comparisons between the Gaussian and LogGaussian results. The resulting prior distribution is(2.3) 
3 Online Learning and Exploration of the Energy Model
We develop an online learning framework to explore the parameters of the energy model adaptively alongside solving sequentially the navigation (optimization) problem instances. At the beginning, the exact energy consumption of the road segments and the parameters of the respective model are unknown. Thus, we start with an approximate and possibly inaccurate estimate of the parameters. We use the current estimates to solve the current navigation task. We then update the model parameters according to the observed energy consumption at different segments (edges) of the navigated path, and use the new parameters to solve the next problem instance.
Alg. 1 describes these steps, where and refer to the current parameters of the energy model for all the edges at the current session , which are used to obtain the current edge weights ’s. We solve the optimization problem using ’s to determine the optimal action (or the arm in the nomenclature of multiarmed bandit problems) , which in this context is a path. The action is applied and a reward is observed, consisting of the actual measured energy consumption for each of the passed edges. Since we want to minimize energy consumption, we regard it as a negative reward when we update the parameters (shown for example for the Gaussian model in Alg. 2). indicates the total number of sessions, sometimes called the horizon. For measuring the effectiveness of our online learning algorithm, we consider its regret, which is the difference in the total expected reward between always playing the optimal action and playing actions according to the algorithm. Formally, the instant regret at session is defined as where is the maximal expected reward for any action, and the cumulative regret is defined as .
3.1 Shortest Path Problem as MultiArmed Bandit
A combinatorial bandit [12] is a multiarmed bandit problem where an agent is only allowed to pull sets of arms instead of an individual arm. However, there may be restrictions on the feasible combinations of the arms. We consider the combinatorial semibandit case where the rewards are observed for each individual arm pulled by an agent during a round.
A number of different combinatorial problems can cast to multiarmed bandits in this way, among them the shortest path problem is the focus of this work. Efficient algorithms for the deterministic problem (e.g. Dijkstra’s algorithm [11]) can be used as an oracle [24] to provide feasible sets of arms to the agent, as well as to maximize the expected reward.
We connect this to the optimization problem in Alg. 1, where we want to find an action . At time , let be a directed graph with weight function and sets of vertices and edges . Given a source vertex and a target vertex , let be the set of all paths in such that . Assuming nonnegative edge costs for each edge , the problem of finding the shortest path (action ) from to can be defined as
(3.1) 
3.2 Thompson Sampling
Since the greedy method does not actively explore the environment, there are other methods which performs better in terms of minimizing cumulative regret. One commonly used method is greedy, where a (uniform) random action is taken with probability and the greedy strategy is used otherwise. However, this method is not well suited to the shortest path problem, since a random path from the source vertex to the target would almost certainly be very inefficient in terms of accumulated edge costs.
An alternative method for exploration is Thompson Sampling (TS). In our Bayesian setup, the greedy strategy chooses the action which maximizes the expected reward according to the current estimate of the mean rewards. In contrast, with TS the agent samples from the model, i.e., it selects an action which has a high probability of being optimal by sampling mean rewards from the posterior distribution and choosing an action which maximizes those during each session.
Thompson Sampling for the energy consumption shortest path problem is outlined in Alg. 3, where it can be used in Alg. 1 to obtain the edge weights in the network (only shown for the Gaussian model). In the following, we provide an upper bound on the cumulative regret of Thompson Sampling for the shortest path navigation problem. ^{3}^{3}3We defer the full proof to the appendix.
Theorem 1.
Let be a weighted directed graph, with nodes and edges. Let be the number of paths in . The expected cumulative regret of Alg. 3 satisfies .
Proof sketch.
The online shortest path problem could be viewed as (1) a combinatorial semibandit problem with linear reward functions, where the feedback includes all the subarms (edge) in the played arm (path from source to target); or (2) a reinforcement learning problem, where node
corresponds to a state, and each edge corresponds to an action. For (1), we get a expected cumulative regret bound as [24] (ignoring logarithmic factors); for (2), we get a regret bound as [19]. Therefore, combining the two views we obtain the regret bound. ∎3.3 Upper Confidence Bound
Another class of algorithms demonstrated to work well in the context of multiarmed bandits is the collection of the methods developed around Upper Confidence Bound (UCB). Informally, these methods are designed based on the principle of optimism in the face of uncertainty. The algorithms achieve efficient exploration by choosing the arm with the highest empirical mean reward added to an exploration term (the confidence width). Hence, the arms chosen are those with a plausible possibility of being optimal.
In [8] a combinatorial version of UCB (CUCB) is shown to achieve sublinear regret for combinatorial semibandits. However, using a Bayesian approach is beneficial in this problem since it allows us to employ the theoretical knowledge on the energy consumption in a prior. Hence, we consider BayesUCB [15] and adapt it to the combinatorial semibandit setting. Similar to [15]
, we denote the quantile function for a distribution
as such that . The idea of that work is to use upper quantiles of the posterior distributions of the expected arm rewards to select arms. If denotes the posterior distribution of an arm and is the current session, the Bayesian Upper Confidence Bound (BayesUCB) is .This method is outlined in Alg. 4 for the Gaussian model. Here, since the goal is to minimize the energy consumption which can be considered as the negative of the reward, thus, we use the lower quantile .
4 MultiAgent Learning and Exploration
The online learning may speed up via having multiple agents exploring simultaneously and sharing information on the observed rewards with each other. In our particular application, this corresponds to a fleet of vehicles of similar type sharing information about energy consumption across the fleet. Such a setting can be very important for road planning, electric vehicle industries and city principals.
The communication between the agents for the sake of sharing the the observed rewards can be synchronous or asynchronous. In this paper, we consider the synchronous setting, where the vehicles drive concurrently in each time step and share their accumulated knowledge with the fleet before the next iteration starts. At each session, any individual vehicle independently selects a path to explore/exploit according to the online learning strategies provided in Sec. 3. It is notable that our online learning framework and theoretical analysis are applicable to the asynchronous setting in a similar manner. Below, we provide a regret bound for the TSbased multiagent learning algorithm under the synchronous setting.
Theorem 2 (Synchronous Multiagent Learning).
Let be the number of agents, and be the number of sessions. Given a weighted directed graph , the expected cumulative regret of the synchronized multiagent online learning algorithm (i.e., agents working in parallel in each session) invoking Alg. 3 satisfies .
The proof (provided in longer version, see Footnote 3) considers the online shortest path problem as a combinatorial semibandit problem, and treats the multiagent setting as a sequential algorithm with delayed feedback. The result could be obtained as a corollary of Theorem 1 of Sec. 3.2 and Theorem 6 of [14] which converts online algorithms for the nondelayed case to ones that can handle delays in the feedback, while retaining their theoretical guarantees.
5 Experimental Results
In this section, we describe different experimental studies. For realworld experiments, we extend the simulation framework presented in [21] to network/graph bandits with general directed graphs, in order to enable exploration scenarios in realistic road networks. Furthermore, we add the ability to generate synthetic networks of specified size to this framework, in order to verify the derived regret bounds (as the ground truth is provided for the synthetic networks).
5.1 RealWorld Experiments
Settings
We utilize the Luxembourg SUMO Traffic (LuST) Scenario data [10] to provide realistic traffic patterns and vehicle speed distributions for each hour of the day. This is used in conjunction with altitude and distances from map data, as well as vehicle parameters from an electric vehicle. The resulting graph has nodes and edges, representing a road network with km of highways, arterial roads and residential streets. The difference in altitude between the lowest point in the road network and the highest is meters.
We use the default vehicle parameters that were provided for the energy consumption model in [4], with vehicle frontal surface area meters, air drag coefficient and rolling resistance coefficient . The vehicle is a medium duty truck with vehicle mass kg, which is the curb weight added to half of the payload capacity.
We approximate the powertrain efficiency during traction by and powertrain efficiency during regeneration by . In addition, we use the constant gravitational acceleration and air density .
To simulate the ground truth of the energy consumption, we take the average speed of each edge from a full 24 hour scenario in the LuST traffic simulation environment. In particular, we observe the values during a peak hour (8 AM), with approximately 5500 vehicles active in the network. This hour is selected to increase the risk of traffic congestion, hence finding the optimal path becomes more challenging. We also get the variance of the speed of each road segment from LuST. Using this information, we sample the speed value for each visited edge and use the energy consumption model to generate the rewards for the actions.
For the Gaussian likelihood , we assume to be proportional to in Eq. 2.1, such that . For the LogGaussian likelihood, we choose , so that it has the same variance as the Gaussian likelihood. We set for both. For the prior of an edge , we use the speed limit of as , indicating that the average speed is unknown. Then and , where .
Results
As a baseline, we consider the greedy algorithm for both the Gaussian and LogGaussian models, where the exploration rule is to always choose the path with the lowest currently estimated expected energy consumption, an extension of the recent method in [4].
We run the simulations with a horizon of (i.e., sessions or problem instances). Fig. 0(b) shows the cumulative regret for the Gaussian and LogGaussian models, where the regret is averaged over 5 runs for each agent. In Fig. 0(a), instant regret averaged over 5 runs is shown for the same scenario. It is clear that Thompson Sampling with the LogGaussian model has the best performance in terms of cumulative regret, but the other nongreedy agents also achieve good results. To illustrate Thompson Sampling explores the road network in a reasonable way, Fig. 0(c) visualizes the road network and the paths visited by this exploration algorithm. We observe that no significant detours are performed, in the sense that most paths are close to the optimal path. This indicates the superiority of Thompson Sampling to a random exploration method such as greedy in our application.
For the multiagent case, we use a horizon of and 10 scenarios where we vary the number of concurrent agents by . The cumulative regret averaged over the agents in each scenario is shown in Fig. 3 for each . In the figure, the final cumulative regret for each agent decreases sharply with the addition of just a few agents to the fleet. This continues until there are five agents, after which there seems to be diminishing returns in adding more agents. While there is some overhead (parallelism cost), just enabling two agents to share knowledge with each other decreases their average cumulative regret at by almost a third. This observation highlights the benefit of providing collaboration early in the exploration process, which is also supported by the regret bound in Thm. 2. More detail can be found in the supplementary material.
5.2 Synthetic Networks
Settings
In order to evaluate the regret bound in Thm. 1, we design synthetic directed acyclic network instances according to a specified number of vertices and number of edges (with the constraint that ). We start the procedure by adding vertices to . Then for each we add an edge to . This ensures that the network contains a path with all vertices in . Finally, we add edges uniformly at random to , such that , and . An example of such a synthetic network with and is shown in Fig. 1(a).
Since these networks are synthetic, instead of modeling probabilistic energy consumption, we design instances where it is difficult for an exploration algorithm to find the path with the lowest expected cost. Given a synthetic network generated according to the aforementioned procedure, we select to be the optimal path. In other words, contains every vertex . The reward distribution for each edge in is chosen to be with and . For where , we set , where is the number of vertices skipped by the shortcut. This guarantees that no matter the size of the network and the number of edges that form shortcuts between vertices in , will always have a lower expected cost than any other path in .
For the prior , we set and . This choice implies according to our prior, every path from the source to the target will initially have the same estimated expected cost.
Results
We run the synthetic network experiment with sessions, varying the number of vertices and edges . In Fig. 1(b), each plot represents the cumulative regret at for a fixed , as a function of . In Fig. 1(c), we instead look at the regret for fixed as a function of . We observe that the regret increases with the number of edges and decreases with the number of vertices. This observation is consistent with the theoretical regret bound in Thm. 1. By increasing the number of edges while is fixed, the number of paths and the other terms increase too, which yields a larger regret bound. On the other hand, with a fixed increasing the number of nodes increases the sparsity, i.e., the number of paths decreases, which in turn yields a lower regret bound.
6 Conclusion
We developed a Bayesian online learning framework for the problem of energyefficient navigation of electric vehicles. Our Bayesian model assume a Gaussian or LogGaussian energy model. To learn the unknown parameters of the model, we adapted exploration methods such as Thompson Sampling and UCB within the online learning framework. We extended the framework to multiagent setting and established theoretical regret bounds in different settings. Finally, we demonstrated the performance of the framework with several realworld and synthetic experiments.
References
 [1] (2012) Analysis of thompson sampling for the multiarmed bandit problem. In Conference on learning theory, pp. 39–1. Cited by: §A.1.

[2]
(2010)
The shortest path problem revisited: optimal routing for electric vehicles.
In
Annual conference on artificial intelligence
, pp. 309–316. Cited by: §1.  [3] (2002) Using confidence bounds for exploitationexploration tradeoffs. J. Mach. Learn. Res. 3, pp. 397–422. Cited by: §1.
 [4] (2019) Energy consumption estimation integrated into the electric vehicle routing problem. Transportation Research Part D: Transport and Environment 69, pp. 141–167. Cited by: §1, §2, §5.1, §5.1.
 [5] (2017) Consumption Profiles in Route Planning for Electric Vehicles: Theory and Applications. In 16th International Symposium on Experimental Algorithms (SEA 2017), Leibniz International Proceedings in Informatics (LIPIcs), Vol. 75, pp. 19:1–19:18. Note: Keywords: electric vehicles, charging station, shortest paths, route planning, profile search, algorithm engineering External Links: ISBN 9783959770361, ISSN 18688969, Document Cited by: §1.
 [6] (2013) Priorfree and priordependent regret bounds for thompson sampling. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, pp. 638–646. Cited by: §1.
 [7] (2011) An empirical evaluation of thompson sampling.. In NIPS, pp. 2249–2257. Cited by: §1.

[8]
(2013)
Combinatorial multiarmed bandit: general framework and applications.
In
International Conference on Machine Learning
, pp. 151–159. Cited by: §1, §3.3.  [9] (2017) Efficient online learning for optimizing value of information: theory and application to interactive troubleshooting. In Proceedings of the ThirtyThird Conference on Uncertainty in Artificial Intelligence, UAI, Cited by: §1.
 [10] (2017) Luxembourg SUMO Traffic (LuST) Scenario: Traffic Demand Evaluation. IEEE Intelligent Transportation Systems Magazine 9 (2), pp. 52–63. Cited by: §5.1.
 [11] (1959) A note on two problems in connexion with graphs. Numerische mathematik 1 (1), pp. 269–271. Cited by: §3.1.
 [12] (2012) Combinatorial network optimization with unknown variables: multiarmed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking (TON) 20 (5), pp. 1466–1478. Cited by: §3.1.
 [13] (2010) Webscale bayesian clickthrough rate prediction for sponsored search advertising in microsoft’s bing search engine. In ICML, pp. 13–20. Cited by: §1.
 [14] (2013) Online learning under delayed feedback. In International Conference on Machine Learning, pp. 1453–1461. Cited by: §A.2, §4.
 [15] (2012) On bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics, pp. 592–600. Cited by: §1, §3.3.
 [16] (2012) Thompson sampling: an asymptotically optimal finitetime analysis. In Algorithmic Learning Theory  23rd International Conference, ALT, pp. 199–213. Cited by: §1.
 [17] (2018) On bayesian index policies for sequential resource allocation. The Annals of Statistics 46 (2), pp. 842–865. Cited by: §1.
 [18] (2013) (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, pp. 3003–3011. Cited by: §A.1, §A.1, §1.
 [19] (2017) Why is posterior sampling better than optimism for reinforcement learning?. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2701–2710. Cited by: §A.1, §3.2.
 [20] (2014) Modeling human decision making in generalized gaussian multiarmed bandits. Proceedings of the IEEE 102 (4), pp. 544–571. Cited by: §1.
 [21] (2018) A tutorial on thompson sampling. Foundations and Trends® in Machine Learning 11 (1), pp. 1–96. External Links: Link, Document, ISSN 19358237 Cited by: §5.
 [22] (2011) Efficient energyoptimal routing for electric vehicles. In Twentyfifth AAAI conference on artificial intelligence, Cited by: §1.
 [23] (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3–4), pp. 285––294. Cited by: §1.
 [24] (2018) Thompson sampling for combinatorial semibandits. In International Conference on Machine Learning, pp. 5101–5109. Cited by: §A.1, §1, §3.1, §3.2.
 [25] (2015) Electric vehicles’ energy consumption measurement and estimation. Transportation Research Part D: Transport and Environment 34, pp. 52–67. Cited by: §2.
Appendix A Proofs
a.1 Proof of Theorem 1
Proof.
Reinforcement learning view
We first view the online shortest path problem as a reinforcement learning problem, where node corresponds to a state of the agent, and each edge corresponds to an action. Therefore, the online learning framework involving Alg. 3 could be viewed as the posteriorsampling reinforcement learning (PSRL) algorithm of [18]. In the following, we show that the expected cumulative regret under this setting is bounded by (note that “ignores” logarithmic factors).
Concretely, [18] study posterior sampling for reinforcement learning, and establish an regret bound on the expected regret, where is the number of episodes, is the episode length and and
are the cardinalities of the state and action spaces for episodic reinforcement learning (RL) on a finite time horizon (unknown) Markov decision process. Note that for the online shortest path problem considered in this paper, the maximal number of actions per session is upper bounded by
. Therefore, we can view it as a special case of an episodic RL problem, with episode length(the regret will be padded by 0 if the length of the computed shortest path for a session is less than
). Under the RL view, observe that , , . Following the analysis of [19] we obtain the bound for the expected regretCombinatorial semibandits view
We can also view the online shortest path problem as a combinatorial semibandit problem, where the feedback includes all the subarms (i.e., edge value) in the played arm (i.e., a path from source to target). By [24], we obtain an expected regret bound of , where is the maximal number of arms reviewed per session. Note that the classical regret bound for Multiarmed bandit problem [1] also applies to our problem, we further obtain an expected regret bound of , where denote the number of “superarms”, i.e., the total number of paths that could be visited per session.
Combining the reinforcement learning view and the combinatorial semibandit view view we obtain the claimed regret bound of
which completes the proof. ∎
a.2 Proof of Theorem 2
Proof.
We consider the online shortest path problem as a combinatorial semibandit problem, and treats the multiagent setting as a sequential algorithm with delayed feedback.
Let denote the cumulative regret of the multiagent Thompson sampling algorithm with agents over sessions, and denote the cumulative regret of the sequential (singleagent) version of the algorithm over sessions. By Theorem 6 of [14], we get
(A.1) 
Here, denotes the delay in feedback for the action played by a sequential algorithm. Clearly, in the multiagent update setting, . By Theorem 1 we know
(A.2) 
Appendix B Supplemental Experimental Results
In Fig. 4 we provide an alternative perspective on the multiagent experiments performed in the main part of the paper (Sec. 5.1), where average cumulative regret is shown as a function of the number of agents in a scenario, for a number of fixed horizons . This further supports the result in Sec. 5.1 and demonstrates that employing only a few agents to collaborate decreases their average cumulative regret significantly. This result also shows the diminishing returns from adding more agents after the fifth agent has been added. In a similar way, Fig. 5 illustrates the total cumulative regret (sum over all agents) as a function of the number of agents, where each curve corresponds to a fixed number of actions. This observation indicates that despite the diminishing returns, the total cumulative regret does not dramatically increase with a higher number of agents, consistent with the theoretical analysis in Sec. 4.
Related to the experiments in Sec. 5.2, Fig. 6 shows the cumulative regret of a number of synthetic networks with varying and as a function of session . Consistently, by sparsifying the network via changing either the number of nodes or the number of edges, the experimental regret decreases. This result is consistent with the theoretical regret bound in Thm. 1.