The multi-armed bandit problem is an online learning problem in which a player has access to a set of choices (i.e., “arms”) each of which provides some reward (i.e., “gain”). At each time step, the player chooses an arm and gets some reward. In stochastic variants, rewards are determined by some probabilistic distribution. In adversarial variants, an adversary specifies the rewards. Amazingly, even when rewards are adversarially chosen, the player can do fairly well! For example, the EXP3 algorithm  minimizes the player’s “regret”, ensuring that the player does almost as well as if she had selected the single fixed best arm throughout. Another fascinating property of bandit algorithms is that they work well in multi-player settings [27, 16], converging to close variants of a Nash equilibrium.
Recently, it has been shown that bandit-style algorithms can efficiently solve the wireless network selection problem, yielding good performance both in theory and in practice [1, 2, 7]. In this problem, each user has access to a collection of networks (e.g., a few different WiFi networks and a 4G connection); the goal is to pick networks with higher data rates. Selecting the best network is challenging, especially in dynamic environments where the “best” network changes over time, as users move and network bandwidth fluctuates. This can be modeled as an adversarial bandit problem and solved with EXP3 and its variants.
Bandit algorithms have one major weakness in dynamic settings (such as wireless network settings): they are designed to learn the average payoff of each arm, and to converge to the arm that provides the best average performance. In the stochastic case, this is exactly what you want. In the adversarial case, it leads to minimum regret, i.e., the user does almost as well as if they knew the best network in advance. If, however, the situation is changing over time, and especially if it is changing in some predictable manner, then learning the average payoff of each arm is not productive.
Periodic, repetitive patterns are a particularly common type of dynamic behavior. Take, for example, the problem of network selection. Network behavior is often repetitive, with user density and network quality following regular patterns: for example, office WiFi networks have no users at night, their performance drops when workers arrive in the morning, and the performance improves again during lunch hour. Other networks are clogged with streaming video during lunch hour and in the evenings. Periodic patterns are ubiquitous.
Unfortunately, bandit algorithms will fail badly in the case of periodic behavior. As an example, suppose a player is playing a slot machine with two arms. The first arm gives a reward of
when pulled on odd-numbered hours andotherwise, while the second arm does the reverse, with a reward of on even-numbered hours and otherwise. In this simple case, a bandit algorithm will never learn this pattern, instead converging to the best single-action policy; and the best policy can only reap half of the maximum reward. The player will receive an average payout of only per selection, despite a very predictable pattern. And when this case is extended to cycle among arms, the best fixed choice of arm gives only of the total obtainable reward. Thus, algorithms like EXP3 that minimize the regret do not guarantee good performance on periodic problems.
Our goal in this paper is to develop an efficient adversarial bandit algorithm for periodic settings, and to demonstrate the effectiveness of this algorithm in the context of the wireless network selection problem, yielding a new approach to network selection in dynamic, periodic environments. The first step is to establish the right metric by which to evaluate bandit algorithms. The performance of an adversarial bandit algorithm is heavily characterized by the definition of “regret,” which forms the baseline that it competes against. And traditionally, the regret is computed with respect to the best fixed strategy.
For the periodic bandit setting, we define a better performance measure, ?periodic regret?, which compares an algorithm’s performance against the best periodic choice of arms. No choice of period may match the input data perfectly, but the goal of periodic regret is to compare against the best choice. Moreover, we provide a generalized notion of periodicity, so that this notion of periodic regret can capture different types of patterned behavior.
Next, we develop an algorithm that minimizes periodic regret, Periodic EXP4, a computationally efficient variant of EXP4 (Exponential-weight algorithm for Exploration and Exploitation using Expert advice) . We show that the algorithm minimizes periodic regret in the following sense: with arms, possible periods, with each possible period of at most length , then in an execution of length the periodic regret is at most . We also prove a lower bound of on periodic regret in an adversarial setting, showing that this is optimal within log-factors. An important aspect of Periodic EXP4 is that it is a polynomial time algorithm: we leverage the structure provided by the target periodic patterns to reduce the computational complexity. This is in contrast to EXP4 which requires exponential time and space in this context.
The other major contribution of this paper is a new algorithm for network selection that is especially optimized for environments with periodic, patterned behaviors. We simulate the network selection problem, comparing Periodic EXP4 to EXP3 and to a “randomized optimal” omniscient solution. (We have previously seen in  that these types of simulations are reasonably predictive of real-world behavior.)
Our first observation is that Periodic EXP4 does in fact efficiently learn periodic patterns and adapts relatively quickly to changes in network data rates (both discrete and continuous). We also see that Periodic EXP4 does indeed outperform EXP3 in periodic settings, as expected, potentially yielding significant real-world improvements.
Our second question involved the robustness of Periodic EXP4 to noisy patterns. Real-world periodic patterns are rarely perfectly periodic, suffering noise and variance. We experiment with noisy patterns, and see that Periodic EXP4 continues to work well.
Finally, our third set of experiments looked at the performance of Periodic EXP4 in the context of user mobility. We simulate several scenarios where users change location over time, leading to changes in which networks they can access (and hence changes in the load on those networks). For example, we imagine a typical office scenario where users arrive at the office in the morning, take a break for lunch, return to work, and then head home at the end of the day. We observe that Periodic EXP4 can also learn this type of periodic behavior, again, learning to adapt the users’ network selection in a near-optimal fashion. In fact, we compare two versions of the algorithm: one in which the algorithm is notified when networks become unavailable, and one in which it is not—we observe that even in the latter case where it is completely oblivious to the changes, the user strategy converges to near-optimal choices.
Overall, we conclude that periodic adversarial bandit algorithms may have significant value, that Periodic EXP4 is an efficient algorithm for the problem, and that it yields a potentially interesting and useful approach to network selection.
2 Related work
In this section, we discuss relevant work done on bandit algorithms, and state-of-art wireless network selection approaches. Multi-armed bandit techniques have been successfully applied to wireless network selection [1, 2, 7]. They have also been considered for other resource selection problems, such as channel selection [13, 27], selection of the right sensors to query in a sensor network , and selection of replica server for content distribution networks .
Many variations of bandit problems have been studied, in both stochastic and adversarial settings. EXP3 is the most well-known algorithm for the standard adversarial bandit problem. With arms and time steps, it establishes a pseudo-regret upper bound of , which almost matches the lower bound of . The gap in the bounds has been recently closed by  bringing the upper bound down to . But, these bound the regret against the best single-action policy, limiting their usefulness in a periodic setting.
A related problem is that of bandits with expert advice, defined in the same paper . It defines a more general notion of regret, by competing against the best policy from a set. With arms, time steps and experts, the EXP4 algorithm gives a pseudo-regret bound of . However, its possibly high running time and memory cost limit its use in practice. There are other algorithms for bandits with expert advice, like Context-FTPL. The latter is more computationally efficient, but has a weaker regret bound . A lower bound of  has been shown, but the gap in bounds has not been closed.
An equivalent formulation of our generalized periodic regret (explained later in Section 4.2) has been briefly discussed in [10, Chapter 4.2.1], phrased as a contextual bandit problem where the algorithm competes against the best context set from a class of context sets. The possible use of EXP4 is mentioned, but an alternative algorithm with a weaker regret bound is instead discussed as it has a reasonable polynomial-time performance unlike EXP4.
While much of the existing literature assume a single best arm, there are other efforts to look beyond this. One approach to the stochastic version of the problem is to allow reward distributions of the arms to occasionally change [9, 22]. Our work on the other hand is fully adversarial, and makes no assumptions on the rewards produced by the adversary.
Numerous wireless network selection approaches have been proposed. Some are centralized [3, 8, 18, 25]; hence, not scalable and limited to managed networks. A number of distributed approaches have been proposed, with various limitations. Some rely on coordination from networks , while others require cooperation of wireless devices . Others assume global knowledge [20, 4, 19], or availability of some information [30, 11]. A continuous-time multi-armed bandit approach in a stochastic setting has been considered in . A similar setting to ours, though non-periodic and in the stochastic setting, is considered in .
3 Wireless Network Selection
Here, we describe the wireless network selection problem, discuss the periodicity of events in wireless environments, and formulate the network selection problem as a bandit problem.
3.1 Wireless network selection problem.
We consider an environment with multiple wireless devices and heterogeneous wireless networks, such as the one depicted in Figure 1. The latter illustrates four mobile users with their (active) mobile devices, and five wireless networks, namely four WiFi networks and a cellular network (represented using 3 cellular base stations). The wireless networks have limited areas of coverage. Hence, each mobile device may have access to a different set of wireless networks depending on their location, e.g. different networks are available at home and at the office. The bandwidths of wireless networks may also vary with time. Each mobile device aims to quickly identify and associate with the best network, which may vary over time, to maximize their data rates.
Mobile users tend to have daily routines that follow repetitive patterns—going to the office each morning, lunch at noon, returning home in the evening; these activities are performed at fixed times each weekday. Figure 1 broadly depicts the daily routine of a mobile user, Alice. Network behavior, which is affected by user density, is also often repetitive and follows a regular pattern. For example, the available bandwidth of office WiFi networks is likely to be higher during lunch hours, where the office is nearly empty. A good network selection protocol learns and adapts to periodic patterns in network quality for better performance.
3.2 Wireless network selection as a bandit problem
A device must be aware of the bit rate it can observe from each network to perform an optimal network selection. While this information is unknown at the time of selection, the device can estimate the achievable bit rate by exploring the networks. The network selection problem can be seen as a multi-armed bandit problem in a multi-player setting. A mobile device is a player, and each network can be considered as an arm. Every so often (e.g. once per minute), a device selects a network (analogous to pulling an arm) and observes a bit rate (gain) for that network. The gain from other networks is unknown to the device. Given that mobile devices operate in a dynamic environment, they must continuously explore and adapt to changes, by deciding which networks to select in sequence. The goal of each device is to maximize its cumulative gain over time. Since the quality of a wireless network is affected by its number of clients, other mobile devices in the environment may be considered to be adversaries. We hence use the adversarial setting. A leading bandit algorithm in this setting is EXP3.
4 Periodic Bandit Problem
In this section, we introduce the periodic bandit problem and discuss periodic regret.
We consider a general bandit problem. On each time step, an algorithm is allowed to pick any one out of possible arms, and each arm produces a certain amount of reward. These rewards are unknown to the algorithm, which can only observe the reward of the arm it picked. We aim to maximize the total reward obtained by the algorithm. We study the adversarial setting with a possibly adaptive adversary, which decides on the distribution of rewards at each time step, taking into consideration the outcomes of past random events.
Let be the number of arms. The set of arms is . Let be the reward earned by arm at time step . Let be the arm played by the algorithm at time . Let be the total number of time steps. The set of time steps is . Thus, the total reward earned by the algorithm after iterations is . The commonly used performance measure for bandit algorithms is regret. Regret compares the total reward obtained by the algorithm against a “best possible” reward “OPT” after some number of time steps . Different types of regret compare the algorithm’s result to different notions of the optimal result.
We can define a form of regret where OPT is allowed to pick any arm in at each time step. For later reference we will refer to this as full regret, defined as follows:
The above definition uses what is commonly known as pseudo-regret, rather than expected regret. For the rest of this paper, we will often refer to pseudo-regret as simply “regret”. Expectations are taken over the possible randomness of the algorithm and adversary.
In most studies of adversarial bandits, a weaker definition of regret is used. This is because full regret uses too powerful an adversary, and it is impossible to achieve better than linear expected full regret in the worst case (we include a proof in Appendix A.1). Therefore, it is common to define a notion of regret where OPT is required to use the same arm for all time steps. We refer to this as weak regret, defined as follows:
Weak regret however, severely limits what OPT can do, and being competitive with an algorithm that can only pick one arm and stick to it may not be a very strong result.
4.1 Periodic Regret
We can bridge the two with a periodic definition of regret. Taking the idea that a periodic choice of arms is likely to perform well in situations with periodic patterns, we can define a regret function which measures how competitive an algorithm is with the best periodic choice of arms. For example, we can say OPT is forced to play the same arm every steps. This defines a regret function as follows,
As OPT may optionally still pick the same arm on all time steps, this is a generalization of weak regret. This makes for a regret value in between weak regret and full regret.
If we were competing against the regret for a specific, known value of , this would be equivalent to playing independent instances of the adversarial bandits problem over approximately time steps each. By playing separate instances of an algorithm for weak regret, and by Theorem 6.1 in Section 6.1, we have an upper/lower bound of .
However, if we were to consider that the “best possible” period may not be known (for example, if OPT were to consist of the best periodic function for any of the possible periods ), these bounds do not apply as easily.
4.2 Generalized Periodic Regret
A generalization of the periodic case is the use of partition functions. Fix a maximum number of labels . We define this upper bound for use in our analysis later on. A partition function is a function that assigns every time step a label from to . We consider two partition functions the same if their choice of label assignments are permutations of each other. The regret under function would be when OPT is forced to play the same arm for all timesteps with the same label as assigned by .
Consider a set of partition functions for some . is necessarily finite. The regret under the function set would be when OPT can choose to play using any of the partition functions in . This gives the following regret definition:
This definition (2) of periodic regret gives us more choice in how we want to define our potential periodic patterns to learn, through deciding on the labels on each time step for each function. We demonstrate this with our choice of partition functions in Section 7.
To model the example described earlier with periods , we can use the set of partitions , where for each , .
5 The Periodic EXP4 Algorithm
We discuss the relationship between our generalized periodic setting and the problem of bandits with expert advice, and hence the applicability of EXP4  to the problem. We use this to introduce Periodic EXP4, an efficient algorithm for generalized periodic regret.
5.1 Applying Bandits with Expert Advice to Periodic Bandit Problems
Periodic bandit problems can be reduced to the problem of bandits with expert advice. In the problem of bandits with expert advice, we are given a set of experts. Each expert predicts an arm on each time step. We fix the number of time steps . Thus an expert can be seen as a function . An algorithm to solve this problem would make use of each expert’s predictions on each time step, to obtain a reward competitive with the best expert in the set. This gives us the following regret definition:
This can be used to model all of the above notions of regret. For full regret, we have , the set of all possible functions from to . For weak regret, is the set of all constant functions from to .
In the generalized periodic setting, let be the set of partition functions . For each function , let be the set of all possible mappings from the image set of to the set of arms (thus ). Each composition , , thus represents a possible mapping of the time steps to arms. Thus, for the generalized periodic setting, .
We note that when , we will have . Let , and be the sets of functions corresponding to full regret, weak regret and generalized periodic regret under some function set respectively. Thus, for any nonempty set of partition functions, we have .
An existing algorithm for this problem is the EXP4 algorithm , which achieves a regret upper bound of , where . We can thus apply EXP4 directly to our problem. However, a commonly cited drawback of the EXP4 algorithm is that its running time and memory cost are at least linear in . This is an issue as is often very large. For example, in the generalized periodic setting, the size of could easily be on the order of , which is exponential in . However, we show below that in the generalized periodic setting, we can devise an algorithm that is distributionally equivalent to EXP4 and can be made to run in time polynomial in , and .
The EXP4 algorithm works by assigning a weight (with initial value ) to each expert
. The probabilityof playing an arm would then be , the ratio of the combined weights of the experts agreeing to play arm to the total weight of the experts. Whenever an arm is played, each expert who suggested arm will have their weight adjusted by some factor . More details on EXP4 are given in 
. Note that it discusses a more general form of expert advice where each expert suggests a probability vector on the arms. However, we only require the case where at each time step, each expert suggests one arm with probability, and all other arms with probability .
5.2 Periodic EXP4, Memory and Running Time Costs
Periodic EXP4 (Algorithm 1) is distributionally equivalent to the EXP4 algorithm when run with the set of experts . The key intuition behind this algorithm is that the generalized periodic setting produces many symmetries in the weight computation for each expert. Specifically, we take advantage of how for each partition function , the set of experts contains every possible combination of arm assignments to labels in the image set . This allows us to compute the probabilities that EXP4 would play each arm at each time step without computing the individual weights of every expert.
For brevity, let be the number of labels used by the function . Necessarily . The memory requirement is , which is at most . A naive implementation of the algorithm gives a running time of per time step, but with some pre-computation, the running time can be lowered as shown in Appendix A.2.
5.3 Correctness of Periodic EXP4
To show correctness, we show that our algorithm produces the same probability distribution over arms as EXP4 in every time step. Defineas the expert which at time recommends arm with probability and all other arms with probability . We show this algorithm is distributionally equivalent to EXP4, where . In EXP4, each expert would have some weight at time step . At time step , EXP4 plays arm with probability represented by the following expression:
Thus, to show that the two algorithms are distributionally equivalent, as in our algorithm, for each successive time step , we only need to show the following:
The details of this derivation is given in Appendix A.3. We can thus formally state a regret upper bound as follows (Theorem 5.3). This upper bound comes directly from EXP4’s regret bound of , where the number of experts .
With arms, time steps, partition functions, with every function having at most labels, Periodic EXP4 gives a regret upper bound of .
6 Lower Bounds
In this section, we provide lower bounds for the case of a single partition and for a set of partitions. We demonstrate that the upper and lower bounds differ by a factor of .
The existing regret lower bound for the problem of bandits with expert advice  is . This lower bound is derived by dividing the time steps into equal parts. For the generalized periodic setting, as this lower bound uses an instance that can be modeled with a single partition function, it does not give immediate insight into whether having multiple different periods or partition functions increases the difficulty of the problem.
6.1 Lower Bound for a Single Partition
We consider the case with only a single partition function , which partitions the time steps into labels . The sizes of the partitions are respectively. It seems like intuitively, by seeing this as separate instances of the weak regret setting, and by the existing upper/lower bounds on weak regret [6, 5], we would have an upper/lower bound of . For equally sized partitions of size approximately each, this bound would be .
However, while the upper bound is clearly met by running independent instances of an algorithm for weak regret, the lower bound is less clear. Even when considering it as separate instances, there is a possibility of an algorithm “reacting” to losses in other instances to play differently in the current instance, obtaining a higher total reward as a result. For completeness, we include a proof for the lower bound (Theorem 6.1) in Appendix A.4
Fix a partition function which assigns a label to each time step. Assume that for each , there are at least time steps with label . Then the minimax pseudo-regret (1), over all algorithms and adversaries , has a lower bound as follows, for some positive constant :
If we consider the simple case where OPT may play only periodic functions from any period , it can do no worse than if it were only allowed to play at period . We thus obtain a lower regret bound of .
6.2 Lower Bound for the Generalized Periodic Setting
Let be the set of partitions, so is the number of partitions. Let be the maximum number of labels of any partition in . For sufficiently large and , we obtain a pseudo-regret(2) lower bound of . It is proved in Appendix A.5
If instead, a simple lower bound can be obtained by using only out of the arms, so we obtain a problem with arms and maximum partition size . This gives us a lower bound of . We can then merge these two lower bounds into a single expression .
6.3 Analysis of Bounds
A conclusion we can make from Section 6.2 is that having multiple periods indeed increases the difficulty of the problem - we have obtained a lower bound higher than the known upper bound of had only one partition function of the maximum period been used.
With arms, time steps, partition functions, with every function having at most labels, Periodic EXP4 gives an upper bound of . On the other hand, we have a lower bound of in the case where . This gives a gap of between the two bounds. Interestingly, this log-factor is the same as the current gap between the upper and lower bounds in the problem of bandits with expert advice. This is possibly because we use a similar lower bound proof to the problem of bandits with expert advice , as well as a similar algorithm for the upper bound.
7 Experimental Evaluation
In this section, we discuss the implementation details of Periodic EXP4 and parameter values chosen, evaluate the algorithm via simulation, and compare its performance to EXP3 . We show how Periodic EXP4 (a) learns periodic patterns over time under both discrete and continuous changes in network data rates, (b) outperforms EXP3, (c) is robust to noisy patterns, and (d) adapts to changes due to mobility of users.
We benchmark against “Optimal Random”, a player with prior knowledge of the actual bandwidths of each network. In each time slot, it picks a network from a probability distribution equal to the ratios of the bandwidths. For example, with network bandwidths and , the probability of picking the networks will be , and , respectively.
All the algorithms are implemented in Python, using SimPy , while the core algorithm is written in C++. We use a time-varying learning rate  for both Periodic EXP4 and EXP3; slowly tends to zero to ensure convergence  while at the same time ensures that the algorithm does not take too long to learn (it learns slowly when is very small). Although they are not pre-requirements of Periodic EXP4, for simplicity, we assume that (a) a network’s bandwidth is equally shared among its clients, and (b) devices are time-synchronized. To reduce numerical error in our simulations, we substitute computations of with . In nearly all cases, sums of exponentials in our algorithm are heavily dominated by a single term, making the values of the two expressions approximately equal. Experimentally, we find that this has negligible effects on the values computed within the algorithm.
We do simulations on synthetic data. We consider setups with 20 mobile devices and 3 wireless networks, unless otherwise specified. While the number of devices remain constant throughout the simulation run, the data rates and availability of networks may change. We assume that a network selection is performed once every minute; hence, 1440 time slots is one simulated day. All results presented are from 20 simulation runs, of 86,400 time slots each (i.e., 2 simulated months). The pattern of network behavior and/or user mobility over the first 1440 time slots is repeated 60 times; we refer to each repetition as an ?iteration?.
We apply Periodic EXP4 in the generalized periodic setting. We define a partition function of period as one which divides each iteration of 1440 time slots into equal contiguous segments, labeled to in chronological order. The same labels are used for each successive repetition. Unless otherwise specified, we use the period set . This refers to using 24 partition functions, of periods to respectively.
7.1 Evaluation Criteria
Good assignments of devices to networks divide the available bandwidth evenly among the devices. We thus evaluate the performance of the algorithms based on the lowest data rate observed by any of the devices. We compare this to the optimal allocation of devices, which maximizes the lowest data rate observed by any device. If a device with the lowest data rate observes 3Mbps, but the optimal’s lowest is 5Mbps, we say it loses 40% of its achievable gain. We refer to this percentage loss as the “distance to optimal minimum” in our results.
We do not use average cumulative gain as a performance measure because in our problem setting, average gain is maximized as long as there is at least one user in each network.
7.2 Performance Comparison of Algorithms
We consider two setups, both at an office with two WiFi networks and a cellular network. The data rates of these networks vary over time. The first setup involves discrete changes in network bandwidths at fixed time intervals (Figure 1(a)). In the second setup, the data rates vary continuously with time (Figure 1(b)). Figures 2(a) and 2(b) show that in both setups, the distance to optimal minimum of Periodic EXP4 drops over time while EXP3 shows no noticeable improvement with time.
Figure 4 for the continuous setup explains this improvement. The figure for the discrete setup is in Appendix B.1. At each time step, each user has a probability of picking each of the networks. If we consider the combined probability of picking each network, we can see that in Periodic EXP4, these probabilities converge towards the ratios of the bandwidths of the networks (Figures 3(c)). This is despite the continuous setup having no obvious best period. On the other hand, EXP3’s probabilities slowly flatten out (Figure 3(a)). This is consistent with what we would expect, as EXP3 seeks to be competitive with the best fixed-action policy, meaning that it only seeks out the best fixed arm to play.
Figure 5 shows that while EXP3 initially learns more quickly, Periodic EXP4 eventually outperforms EXP3 (which converges to the network with the best average performance), with a performance similar to Optimal Random. From our experiments, we find that while all algorithms have similar total cumulative gains, we may note that Periodic EXP4 is fairer than EXP3, with significantly lower variance. We present these results in Appendix B.4.
7.3 Other Experiments
In Appendix B, we discuss a few more experiments, the results of which are briefly summarized as follows:
Performance in Noisy Settings: On each time step, we apply a 10% Gaussian noise to each of the networks’ data rates. We find that our algorithms are largely unaffected by noise in the data, giving similar results with and without noise.
Comparison of Period Sets: We do a comparison between different possible period sets . We find that the algorithm learns more slowly with larger period sets (e.g. , as compared to ), but can converge to better results on more complex instances (instances where the bandwidth may fluctuate more wildly).
Mobility of Users: We consider a setup where users move around and have access to different sets of networks at different times. We compare Vanilla Periodic EXP4, which is oblivious to networks possibly becoming unavailable, against an optimized version, which selects only from the set of currently available networks. While the optimized version initially yields a better performance, they eventually perform equally well when the Vanilla Periodic EXP4 algorithm learns the pattern.
In this paper, we develop an efficient variant of EXP4 for the periodic bandit problem, give nearly matching upper and lower bounds for it, and demonstrate its advantages in learning periodic behavior in the context of the network selection problem.
An interesting issue raised in contrasting this paper and [9, 22] is whether non-stationary bandit problems are better modeled stochastically or adversarially. While these papers address non-stationary rewards primarily in a stochastic setting with some adversarial aspects, we tackle the periodic bandit problem in a fully adversarial setting. Using the adversarial setting has the benefit of not placing any constraints on the adversary; we adapt to the periodic setting only through our definition of regret. A proper comparison of stochastic and adversarial methods for network selection is a possible future line of work.
-  Anuja Meetoo Appavoo, Seth Gilbert, and Kian-Lee Tan. Shrewd selection speeds surfing: Use smart exp3! In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pages 188–199. IEEE, 2018.
-  Anuja Meetoo Appavoo, Seth Gilbert, and Kian-Lee Tan. Cooperation speeds surfing: Use co-bandit! arXiv preprint arXiv:1901.07768, 2019.
-  E. Aryafar, A. Keshavarz-Haddad, C. Joe-Wong, and M. Chiang. Max-min fair resource allocation in hetnets: Distributed algorithms and hybrid architecture. In ICDCS, 2017, pages 857–869. IEEE, 2017.
-  E. Aryafar, A. Keshavarz-Haddad, M.l Wang, and M. Chiang. Rat selection games in hetnets. In INFOCOM, pages 998–1006. IEEE, 2013.
-  Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217–226, 2009.
-  P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
-  O. Avner and S. Mannor. Multi-user lax communications: A multi-armed bandit approach. In IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications, pages 1–9, April 2016. doi:10.1109/INFOCOM.2016.7524557.
-  Y. Bejerano, S-J. Han, and L. E. Li. Fairness and load balancing in wireless lans using association control. In MobiCom, pages 315–329. ACM, 2004.
-  Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 199–207. Curran Associates, Inc., 2014. URL: http://papers.nips.cc/paper/5378-stochastic-multi-armed-bandit-problem-with-non-stationary-rewards.pdf.
Sébastien Bubeck, Nicolo Cesa-Bianchi, et al.
Regret analysis of stochastic and nonstochastic multi-armed bandit
Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
-  M. H. Cheung, F. Hou, J. Huang, and R. Southwell. Congestion-aware distributed network selection for integrated cellular and wi-fi networks. arXiv preprint arXiv:1703.00216, 2017.
-  S. Deng, A. Sivaraman, and H. Balakrishnan. All your network are belong to us: A transport framework for mobile network selection. In HotMobile. ACM, 2014.
-  Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Learning multiuser channel allocations in cognitive radio networks: A combinatorial multi-armed bandit formulation. In New Frontiers in Dynamic Spectrum, 2010 IEEE Symposium on, pages 1–9. IEEE, 2010.
-  D. Golovin, M. Faulkner, and A. Krause. Online distributed sensor selection. In Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Networks, pages 220–231. ACM, 2010.
-  B. Kauffmann, F. Baccelli, A. Chaintreau, V. Mhatre, K. Papagiannaki, and C. Diot. Measurement-based self organization of interfering 802.11 wireless access networks. In INFOCOM 2007, pages 1451–1459. IEEE, 2007.
-  R. Kleinberg, G. Piliouras, and E. Tardos. Multiplicative updates outperform generic no-regret learning in congestion games. In ACM STOC, pages 533–542. ACM, 2009.
-  S. Maghsudi and S. Stanczak. Relay selection with no side information: An adversarial bandit approach. In WCNC, pages 715–720. IEEE, 2013.
-  A. Mishra, V. Brik, S. Banerjee, A. Srinivasan, and W. A. Arbaugh. A client-driven approach for channel management in wireless lans. In Infocom, 2006.
-  E Monsef, A. Keshavarz-Haddad, E. Aryafar, J. Saniie, and M. Chiang. Convergence properties of general network selection games. In INFOCOM, pages 1445–1453. IEEE, 2015.
-  D. Niyato and E. Hossain. Dynamics of network selection in heterogeneous wireless networks: An evolutionary game approach. TVT, 58(4):2008–2017, 2009.
-  Basil Cameron Rennie and Annette Jane Dobson. On stirling numbers of the second kind. Journal of Combinatorial Theory, 7(2):116–121, 1969.
Allesiardo Robin, Raphaël Feraud, and Odalric-Ambrym Maillard.
The non-stationary stochastic multi-armed bandit problem.
International Journal of Data Science and Analytics, 03 2017. doi:10.1007/s41060-017-0050-5.
Yevgeny Seldin and Gábor Lugosi.
A lower bound for multi-armed bandits with expert advice.
13th European Workshop on Reinforcement Learning (EWRL), 2016.
-  SimPy. SimPy - Event discrete simulation for Python, 2016. https://simpy.readthedocs.io/, accessed 2018-19-12.
-  K. Sui, M. Zhou, D. Liu, M. Ma, D. Pei, Y. Zhao, Z. Li, and T. Moscibroda. Characterizing and improving wifi latency in large-scale operational networks. In MobiSys, pages 347–360. ACM, 2016.
-  Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert Schapire. Efficient algorithms for adversarial contextual learning. In International Conference on Machine Learning, pages 2159–2168, 2016.
-  C. Tekin and M. Liu. Performance and convergence of multi-user online learning. In GAMENETS, pages 321–336. Springer, 2011.
-  H. A. Tran, S. Hoceini, A. Mellouk, J. Perez, and S. Zeadally. Qoe-based server selection for content distribution networks. IEEE Transactions on Computers, 63(11):2803–2815, 2014.
-  Q. Wu, Z. Du, P. Yang, Y.-D. Yao, and J. Wang. Traffic-aware online network selection in heterogeneous wireless networks. TVT, 65(1):381–397, 2016.
-  K. Zhu, D. Niyato, and P. Wang. Network selection in heterogeneous wireless networks: Evolution with incomplete information. In WCNC, pages 1–6. IEEE, 2010.
Appendix A Appendix: Theoretical Results and Proofs
a.1 Lower bound on Worst Case Full Regret
We construct a proof using a deterministic oblivious adversary for full generality. Proofs using a randomized oblivious adversary or adaptive adversary are simpler. A deterministic adversary must select the full sequence of rewards prior to the first round. This is in contrast to an adaptive adversary, a more powerful adversary which is allowed to select rewards each round with full knowledge of the outcomes of random events occurring prior to the round.
Let be the number of time steps and be the number of arms. Fix an algorithm . We show that there exists a problem instance (a predetermined sequence of rewards for each arm) such that algorithm obtains an expected full pseudo-regret of at least .
We construct a problem instance which, for each time step from to , has one arm which gives a reward of , while all other arms give a reward of . We construct each based on the algorithm (but not on the algorithm’s choices) inductively as follows:
At the start of the algorithm, the algorithm plays arms with probabilities respectively. Define to be the arm with the lowest probability of being played.
We maintain the following invariant with parameter - when running the algorithm on the constructed problem instance from time steps to , the expected total reward by the algorithm is at most . With the definition of above, we can see that this invariant holds for .
Now fix any later time step . The algorithm ’s decision on time step can only be based on past rewards and the sequence of arms played by the algorithm on time steps up to . As our choices of to are fixed, the only randomness comes from the algorithm’s choices of arms up to this point. Let represent a possible sequence of arms played by the algorithm for the first time steps. This event occurs with probability , and the algorithm would accumulate a total reward of . Assuming by induction that the invariant holds on time steps up to , we can state that .
Now on time step , based on its past choices of arms , the algorithm constructs a probability vector representing the probabilities of playing each arm on time step . Now we construct independently of . For any fixed , the expected reward by the algorithm after time step is given by:
As we have the following:
there must exist a such that . By selecting such a , we can conclude that:
completing the inductive proof. Thus, we can lower bound the full pseudo-regret as follows:
a.2 Optimized Periodic EXP4
In this section, we give an optimized implementation of Periodic EXP4 to show that the running time bounds can be improved upon with some pre-computation. With this optimization, we have a running time of for pre-computation, and per time step later on. However, we note that such an implementation can potentially increase the amount of numerical error. The reduction in running time largely comes from optimizing the following computation via the introduction of variables and :
a.3 Correctness of Periodic EXP4
In this section, we complete the proof of correctness of Periodic EXP4 as mentioned in Section 5.3. As described before, we show that our algorithm produces the same probability distribution over arms as EXP4 in every time step. In EXP4, is the expert which at time recommends arm with probability and all other arms with probability . We show that Periodic EXP4 is distributionally equivalent to EXP4, where . In EXP4, Each expert would have some weight at time step . At time step , EXP4 plays arm with probability represented by the following expression:
To show that the two algorithms are distributionally equivalent, as in Periodic EXP4, for each successive time step , we only need to show the following:
|is defined in Algorithm 1.|
We first note that for each , , , from the way is defined in Periodic EXP4 (Algorithm 1), we have the following expression:
We then note that in EXP4,
|Divide up by label|
where comes from Periodic EXP4 (Algorithm 1). We then note that:
and that, (the last step is as contains every function )
|Extract current label|
Thus we have,
This shows that the expression for we have defined in Periodic EXP4 corresponds to the sum of weights of all the “experts” from EXP4 which agree to play arm on time step . Thus this concludes the proof of distributional equivalence between the algorithms.
a.4 Worst Case Regret Lower Bound on a Single Partition
This is based on the pseudo-regret in the setting with a single partition function (1).
We make use of a modified formulation by  of a theorem originally presented in . [23, 6] Assume that the number of time steps . Then there exists a randomized oblivious adversary , such that for algorithms ,
We note that this randomized oblivious adversary picks arms independently of the choices made by algorithm . Details on the construction of this adversary are given in . is a constant independent of any parameter.
We now proceed with the proof of Theorem 6.1. For each label , we consider a bandit problem of length . If we assume each , then by Theorem A.4, there exists an adversary such that for any algorithm running on a bandit problem of length ,
We now construct an adversary for a bandit problem of length on partition function . For each time step , let . The adversary takes ’s advice to generate a randomized reward vector for time step .
Now consider any algorithm for a bandit problem of length on partition function , and run it against the adversary . Suppose there exists a label such that:
We can then consider a “restriction” of algorithm that plays on a bandit problem of length . This algorithm would play exactly what algorithm would play on the time steps of label , while simulating ’s plays against adversary internally on all other time steps. Therefore, against the adversary , the algorithm would achieve a pseudo-regret under , which contradicts our choice of adversary .
We can thus conclude that:
a.5 Lower Bound on Worst Case Generalized Periodic Regret
We show a lower bound on the worst-case pseudo-regret in the generalized periodic setting (2) based on , , and . In order to make use of Theorem 6.1 later on in the proof, we first make the base assumption that is sufficiently large. Specifically, we require that .
Fix any integer . Suppose (more labels than arms). We split the time steps into equally sized sections. Now we let be the set of all partitions of into parts. As , we can define this way. Each partition in assigns one label in to each of the sections.
This set of partitions covers all possible ways to assign a different arm to each of the parts.
Consider any assignment , representing each possible assignment of arms to the parts. We have a partition function that partitions into the pre-image sets of by assigning each a separate label. The labels can then be assigned to arms accordingly to represent . ∎
As OPT can choose an arm for each of the sections independently, we obtain a regret lower bound of by Theorem 6.1. Now we express in terms of . As is the number of partitions of into parts, we have , where refers to the Stirling numbers of the second kind. Using the upper and lower bounds for from , we can bound , and thus , as follows:
This expression validates our use of Theorem 6.1, as this with our assumption implies that , and thus for any , there are at least time steps associated with each label. From the bounds on , we have that