Unimodal Thompson Sampling for Graph-Structured Arms

11/17/2016 ∙ by Stefano Paladino, et al. ∙ Politecnico di Milano 0

We study, to the best of our knowledge, the first Bayesian algorithm for unimodal Multi-Armed Bandit (MAB) problems with graph structure. In this setting, each arm corresponds to a node of a graph and each edge provides a relationship, unknown to the learner, between two nodes in terms of expected reward. Furthermore, for any node of the graph there is a path leading to the unique node providing the maximum expected reward, along which the expected reward is monotonically increasing. Previous results on this setting describe the behavior of frequentist MAB algorithms. In our paper, we design a Thompson Sampling-based algorithm whose asymptotic pseudo-regret matches the lower bound for the considered setting. We show that -as it happens in a wide number of scenarios- Bayesian MAB algorithms dramatically outperform frequentist ones. In particular, we provide a thorough experimental evaluation of the performance of our and state-of-the-art algorithms as the properties of the graph vary.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Multi–Armed Bandit (MAB) algorithms [Auer, Cesa-Bianchi, and Fischer2002] have been proven to provide effective solutions for a wide range of applications fitting the sequential decisions making scenario. In this framework, at each round over a finite horizon , the learner selects an action (usually called arm) from a finite set and observes only the reward corresponding to the choice she made. The goal of a MAB algorithm is to converge to the optimal arm, i.e., the one with the highest expected reward, while minimizing the loss incurred in the learning process and, therefore, its performance is measured through its expected regret, defined as the difference between the expected reward achieved by an oracle algorithm always selecting the optimal arm and the one achieved by the considered algorithm. We focus on the so–called Unimodal MAB (UMAB), introduced in [Combes and Proutiere2014a], in which each arm corresponds to a node of a graph and each edge is associated with a relationship specifying which node of the edge gives the largest expected reward (providing thus a partial ordering over the arm space). Furthermore, from any node there is a path leading to the unique node with the maximum expected reward along which the expected reward is monotonically increasing. While the graph structure may be (not necessarily) known a priori by the UMAB algorithm, the relationship defined over the edges is discovered during the learning. In the present paper, we propose a novel algorithm relying on the Bayesian learning approach for a generic UMAB setting.

Models presenting a graph structure have become more and more interesting in last years due to the spread of social networks. Indeed, the relationships among the entities of a social network have a natural graph structure. A practical problem in this scenario is the targeted advertisement problem, whose goal is to discover the part of the network that is interested in a given product. This task is heavily influenced by the graph structure, since in social networks people tend to have similar characteristics to those of their friends (i.e., neighbor nodes in the graph), therefore interests of people in a social network change smoothly and neighboring nodes in the graph look similar to each other [McPherson, Smith-Lovin, and Cook2001, Crandall et al.2008]

. More specifically, an advertiser aims at finding those users that maximize the ad expected revenue (i.e., the product between click probability and value per click), while at the same time reducing the amount of times the advertisement is presented to people not interested in its content.

Under the assumption of unimodal expected reward, the learner can move from low expected rewards to high ones just by climbing them in the graph, preventing from the need of a uniform exploration over all the graph nodes. This assumption reduces the complexity in the search for the optimal arm, since the learning algorithm can avoid to pull the arms corresponding to some subset of non–optimal nodes, reducing thus the regret. Other applications might benefit from this structure, e.g., recommender systems which aims at coupling items with those users are likely to enjoy them. Similarly, the use of the unimodal graph structure might provide more meaningful recommendations without testing all the users in the social network. Finally, notice that unimodal problems with a single variable, e.g., in sequential pricing [Jia and Mannor2011], bidding in online sponsored search auctions [Edelman and Ostrovsky2007] and single–peak preferences economics and voting settings [Mas-Collel, Whinston, and Green1995], are graph–structured problems in which the graph is a line.

Frequentist approaches for UMAB with graph structure are proposed in [Jia and Mannor2011] and [Combes and Proutiere2014a]. Jia and Mannor jia2011unimodal introduce the GLSE algorithm with a regret of order . However, GLSE performs better than classical bandit algorithms only when the number of arms is . Combes and Proutiere combes2014unimodala present the OSUB algorithm—based on KLUCB—achieving asymptotic regret of and outperforming GLSE in settings with a few arms. To the best of our knowledge, no Bayesian approach has been proposed for unimodal bandit settings, included the UMAB setting we study. However, it is well known that Bayesian MAB algorithms—the most popular is Thompson Sampling (TS)—usually suffer of same order of regret as the best frequentist one (e.g., in unstructured settings [Kaufmann, Korda, and Munos2012]), but they outperform the frequentist methods in a wide range of problems (e.g., in bandit problems without structure [Chapelle and Li2011] and in bandit problems with budget [Xia et al.2015]). Furthermore, in problems with structure, the classical Thompson Sampling (not exploiting the problem structure) may outperform frequentist algorithms exploiting the problem structure. For this reason, in this paper we explore Bayesian approaches for the UMAB setting. More precisely, we provide the following original contributions:

  • we design a novel Bayesian MAB algorithm, called UTS and based on the TS algorithm;

  • we derive a tight upper bound over the pseudo–regret for UTS, which asymptotically matches the lower bound for the UMAB setting;

  • we describe a wide experimental campaign showing better performance of UTS in applicative scenarios than those of state–of–the–art algorithms, evaluating also how the performance of the algorithms (ours and of the state of the art) varies as the graph structure properties vary.

Related work

Here, we mention the main works related to ours. Some works deal with unimodal reward functions in continuous armed bandit setting [Jia and Mannor2011, Combes and Proutiere2014b, Kleinberg, Slivkins, and Upfal2008]. In [Jia and Mannor2011] a successive elimination algorithm, called LSE, is proposed achieving regret of . In this case, assumptions over the minimum local decrease and increase of the expected reward is required. Combes and Proutiere combes2014unimodalb consider stochastic bandit problems with a continuous set of arms and where the expected reward is a continuous and unimodal function of the arm. They propose the SP algorithm, based on the stochastic pentachotomy procedure to narrow the search space. Unimodal MABs on metric spaces are studied in [Kleinberg, Slivkins, and Upfal2008].

An application–dependent solution to the recommendation systems which exploits the similarity of the graph in social network in targeted advertisement has been proposed in [Valko et al.2014]. Similar information has been considered in [Caron and Bhagat2013] where the problem of cold–start users (i.e., new users) is studied. Another type of structure considered in sequential games is the one of monotonicity of the conversion rate in the price [Trovò et al.2015]. Interestingly, the assumptions of monotonicity and unimodality are orthogonal, none of them being a special case of the other, therefore the results for monotonic setting cannot be used in unimodal bandits. In [Alon et al.2013, Mannor and Shamir2011], a graph structure of the arm feedback in an adversarial setting is studied. More precisely, they assume to have correlation over rewards and not over the expected values of arms.

Problem Formulation

A learner receives in input a finite undirected graph MAB setting , whose vertices with correspond to the arms and an edge exists only if there is a direct partial order relationship between the expected rewards of arms and . The leaner knows a priori the nodes and the edges (i.e., she knows the graph), but, for each edge, she does not know a priori which is the node of the edge with the largest expected reward (i.e., she does not know the ordering relationship). At each round  over a time horizon of the learner selects an arm  and gains the corresponding reward

. This reward is drawn from an i.i.d. random variable

(i.e., we consider a stochastic MAB setting) characterized by an unknown distribution with finite known support (as customary in MAB settings, from now on we consider ) and by unknown expected value . We assume that there is a single optimal arm, i.e., there exists a unique arm s.t. its expected value and, for sake of notation, we denote with .

Here we analyze a graph bandit setting with unimodality property, defined as:

Definition 1.

A graph unimodal MAB (UMAB) setting is a graph bandit setting s.t. for each sub–optimal arm it exists a finite path s.t.  and for each .

This definition assures that if one is able to identify a non–decreasing path in of expected rewards, she be able to reach the optimum arm, without getting stuck in local optima. Note that the unimodality property implies that the graph is connected and therefore we consider only connected graphs from here on.

A policy over a UMAB setting is a procedure able to select at each round  an arm  by basing on the history , i.e., the sequence of past selected arms and past rewards gained. The pseudo–regret of a generic policy over a UMAB setting is defined as:

(1)

where the expected value is taken w.r.t. the stochasticity of the gained rewards and of the policy .

Let us define the neighborhood of arm as , i.e., the set of each index of the arm  connected to the arm  by an edge . It has been shown in [Combes and Proutiere2014a] that the problem of learning in a UMAB setting presents a lower bound over the regret of the following form:

Theorem 1.

Let be a uniformly good policy, i.e., a policy s.t.  for each . Given a UMAB setting we have:

(2)

where

, i.e., the Kullaback–Leibler divergence of two Bernoulli distributions with means

and , respectively.

This result is similar to the one provided in [Lai and Robbins1985], with the only difference that the summation is restricted to the arms laying in the neighborhood of the optimal arm and reduces to it when the optimal arm is connected to all the others (i.e., ) or the graph is completely connected (i.e., ). We would like to point out that by relying on the assumption of having a single maximum of the expected rewards, we also assure that the optimal arm neighborhood is uniquely defined and, thus, the lower bound inequality in Equation 2 is well defined.

The UTS algorithm

We describe the UTS algorithm and we show that its regret is asymptotically optimal, i.e., it asymptotically matches the lower bound of Theorem 1. The algorithm is an extension of the Thompson Sampling [Thompson1933] that exploits the graph structure and the unimodal property of the UMAB setting. Basically, the rationale of the algorithm is to apply a simple variation of the TS algorithm to only the arms associated with the nodes that compose the neighborhood of the arm with the highest empirical mean reward, called leader.

The UTS pseudo–code

1:  Input: UMAB setting , Horizon , Priors
2:  for  do
3:     Compute for each
4:     Find the leader
5:     if  then
6:        Collect reward
7:     else
8:        Draw from for each
9:        Collect reward where
Algorithm 1 UTS

The pseudo–code of the UTS algorithm is presented in Algorithm 1. The algorithm receives in input the graph structure , the time horizon , and a Bayesian prior for each expected reward . At each round , the algorithm computes the empirical expected reward for each arm (Line 3):

where is the cumulative reward of arm up to round and is the number of times the arm has been pulled up to round .111We here denote with the indicator function. After that, UTS selects the arm denoted as the leader for round , i.e., the one having the maximum empirical expected reward:

(3)

Once the leader has been chosen, we restrict the selection procedure to it and its neighborhood, considering only arms with indexes in . Denote with the number of times the arm  has been selected as leader before round . If is a multiple of , then the leader is pulled and reward  is gained (Line 6).222We here denote with the cardinality operator. Otherwise, the TS algorithm is performed over arms s.t.  (Lines 89).

Basically, under the assumption of having a prior , we can compute the posterior distribution for after rounds, using the information gathered from the rounds in which has been pulled. We denote with a sample drawn from , called Thompson sample. For instance, for Bernoulli rewards and by assuming uniform priors we have that , where

is the beta distribution with parameters

and . Finally, the UTS algorithm pulls the arm with the largest Thompson sample and collects the corresponding reward . See [Kaufmann, Korda, and Munos2012] for further details.

Remark 1.

Assuming that the UTS algorithm receives in input the whole graph is unnecessary. The algorithm just requires an oracle that, at each round , is able to return the neighborhood of the arm which is currently the leader . This is crucial in all the applications in which the graph is discovered by means of a series of queries and the queries have a non–negligible cost (e.g., in social networks a query might be computationally costly). Finally, we remark that the frequentist counterpart of our algorithm (i.e., the OSUB algorithm) requires the computation of the maximum node degree , thus requiring at least an initial analysis of the entire graph .

Finite–time analysis of UTS

Theorem 2.

Given a UMAB setting , the expected pseudo–regret of the UTS algorithm satisfies, for every :

where is a constant depending on , the number of arms and the expected rewards .

Sketch of proof.

(The complete version of the proof is reported in the appendices.) At first, we remark that a straightforward application of the proof provided for OSUB is not possible in the case of UTS. Indeed, the use of frequentist upper bounds over the expected reward in OSUB implies that in finite time and with high probability the bounds are ordered as the expected values. Since we are using a Bayesian algorithm, we would require the same assurance over the Thompson samples , but we do not have a direct bound over where is the optimal arm in the neighborhood . This fact requires to follow a completely different strategy when we analyze the case in which the leader is not the optimal arm.

The regret of the UTS algorithm can be divided in two parts: the one obtained during those rounds in which the optimal arm is the leader, called , and the summation of the regrets in the rounds in which the leader is the arm , called . is obtained when is the leader, thus, the UTS algorithms behaves like Thompson Sampling restricted to the optimal arm and its neighborhood , and the regret upper bound is the one derived in [Kaufmann, Korda, and Munos2012] for the TS algorithm.

is upper bounded by the expected number of rounds the arm has been selected as leader over the horizon . Let us consider defined as the number of rounds spent with as leader when restricting the problem to its neighborhood . is an upper bound over , since there is nonzero probability that the UTS algorithm moves in another neighborhood. Since and the setting is unimodal, there exists an optimal arm among those in the neighborhood s.t.  and . Thus:

where is the expected loss incurred in choosing instead of its best adjacent one .

can be upper bounded by a constant by relying on conditional probability definition and the Hoeffding inequality [Hoeffding1963]. Specifically, we rely on the fact that the leader is chosen at least times. Upper bounding by a constant term requires the use of Proposition 1 in [Kaufmann, Korda, and Munos2012], which limits the expected number of times the optimal arm is pulled less than times by TS, where is a constant, and the use of a technique already used on . Summing up the regret over and considering the three obtained bounds concludes the proof. ∎

KLUCB
TS
UTS
Table 1: Results concerning in the setting with and and a line graph.

Experimental Evaluation

In this section, we compare the empirical performance of the proposed algorithm UTS with the performance of a number of algorithms. We study the performance of the state–of–the–art algorithm OSUB [Combes and Proutiere2014a] to evaluate the improvement due to the employment of Bayesian approaches w.r.t. frequentist approaches. Furthermore, we study the performance of TS [Thompson1933] to evaluate the improvement in Bayesian approaches due to the exploitation of the problem structure. For completeness, we study also the performance of KLUCB [Garivier and Cappé2011], being a frequentist algorithm that is optimal for Bernoulli distributions.

Figures of merit

Given a policy

, we evaluate the average and 95%–confidence intervals of the following figures of merit:

  • the pseudo–regret as defined in Equation 1; the lower the better the performance;

  • the regret ratio showing the ratio between the total regret of policy after rounds and the one obtained with ; the lower the larger the relative improvement of w.r.t. .

Line graphs

We initially consider the same experimental settings, composed of line graphs, that are studied in [Combes and Proutiere2014a]. They consider graphs with arms, where the arms are ordered on a line from the arm with smallest index to the arm with the largest index and with Bernoulli rewards whose averages have a triangular shape with the maximum on the arm in the middle of the line. More precisely, the minimum average is , associated with arms and when and with arms and with , while the maximum average reward is , associated with arm when and with arm with . The averages decrease linearly from the maximum one to the minimum one.

For both the experiments, we average the regret over independent trials of length . We report for each policy as  varies in Fig. 1(a), for , and in Fig. 1(b), for . The UTS algorithm outperforms all the other algorithms along the whole time horizon, providing a significant improvement in terms of regret w.r.t. the state–of–the–art algorithms. In order to have a more precise evaluation of the reduction of the regret w.r.t. OSUB algorithm, we report in Tab. 1. As also confirmed below by a more exhaustive series of experiments, in line graphs the relative improvement of performance due to UTS w.r.t. OSUB reduces as the number of arms increases, while the relative improvement of performance due to UTS w.r.t. TS increases as the number of arms increases.

(a)
(b)
Figure 1: Results for the pseudo–regret in line graphs settings with (a) and (b) as defined in [Combes and Proutiere2014a].

Erdős-Rényi graphs

To provide a thorough experimental evaluation of the considered algorithms in settings in which the space of arms has a graph structure, we generate graphs using the model proposed by Erdős and Rényi erdds1959random, which allows us to simulate graph structures more complex than a simple line. An Erdős-Rényi graph is generated by connecting nodes randomly: each edge is included in the graph with probability , independently from existing edges. We consider connected graphs with and with probability , where corresponds to have a fully connected graph and therefore the graph structure is useless, corresponds to have a number of edges that increases linearly in the number of nodes, corresponds to have a few edges w.r.t. the nodes, and we use to denote line graphs (these line graphs differ from those used for the experimental evaluation discussed above for the reward function, as discussed in what follows). We use different values of in order to see how the performance of UTS changes w.r.t. the number of edges in the graph; we remark that such an analysis is unexplored in the literature so far. The optimal arm is chosen randomly among the existing arms and its reward is given by a Bernoulli distribution with expected value . The rewards of the suboptimal arms are given by Bernoulli distributions with expected value depending on their distance from the optimal one. More precisely, let be the shortest path from the –th arm to the optimal arm and let:

be the maximum shortest path of the graph. The expected reward of the –th arm is:

i.e., the arm with has a value equal to and the expected rewards of the arms along the path from it to the optimal arm are evenly spaced between and . We generate different graphs for each combination of and and we run independent trials of length for each graph. We average the regret over the results of the graphs.

In Tab. 2, we report for each combination of policy , , and . It can be observed that the UTS algorithm outperforms all the other algorithms, providing in every case the smallest regret except for and . Below we discuss how the relative performance of the algorithms vary as the values of the parameters and vary.

KLUCB
TS
OSUB
UTS

KLUCB
TS
OSUB
UTS

KLUCB
TS
OSUB
UTS

KLUCB
TS
OSUB
UTS

KLUCB
TS
OSUB
UTS

KLUCB
TS
OSUB
UTS
Table 2: Results concerning () in the setting with Erdős-Rényi graphs.

Consider the case with . The performance of UTS and TS are approximately equal and the same holds for the performance of OSUB and KLUCB. This is due to the fact that the neighborhood of each node is composed by all the arms, the graphs being fully connected, and therefore UTS and OSUB cannot take any advantage from the structure of the problem. We notice, however, that UTS and TS have not the same behavior and that UTS always performs slightly better than TS. It can be observed in Fig. 2 with and that the relative improvement is mainly at the beginning of the time horizon and that it goes to zero as increases (the same holds for OSUB w.r.t. KLUCB). The reason behind this behavior is that UTS reduces the exploration performed by TS in the first rounds, forcing the algorithm to pull the leader—chosen as the arm maximizing the empirical mean—for a larger number of rounds.

Figure 2: Results for the pseudo–regret in the setting with and .

Consider the case with . In the considered experimental setting, the relative performance of the algorithms does not depend on . The ordering, from the best to the worst, over the performance of the algorithms is: UTS, TS, OSUB, and finally KLUCB. Surprisingly, even the dependency of the following ratios on is negligible: , , and . This shows that the relative improvement due to UTS is constant w.r.t. TS and OSUB as varies. These results raise the question whether the relative performance of OSUB and TS would be the same, except for the numerical values, for every constant w.r.t. . To answer to this question, we consider the case in which , corresponding to the case in which the number of edges is linear in , but it is smaller than the case with . The results in terms of , reported in Table 3 show that OSUB outperforms TS for , suggesting that, when is constant in , OSUB may or may not outperform TS depending on the specific pair .

TS
OSUB
Table 3: Results concerning () in the setting with Erdős-Rényi graphs and .

Consider the case with . The ordering over the performance of the algorithms changes as varies. More precisely, while UTS keeps to be the best algorithm for every and KLUCB the worst algorithm for every , the ordering between TS and OSUB changes. When TS performs better than OSUB, instead when OSUB outperforms TS, see Fig. 3. This is due to the fact that, with a small number of arms, exploiting the graph structure is not sufficient for a frequentist algorithm to outperform the performance of TS, while with many arms exploiting the graph structure even with a frequentist algorithm is much better than employing a general-purpose Bayesian algorithm. The ratio monotonically decreases as increases, from when to when , suggesting that exploiting the graph structure provides advantages as increases. Instead, the ratio monotonically increases as increases, from when to when , suggesting that the improvement provided by employing Bayesian approaches reduces as increases as observed above in line graphs.

Consider the case with . As in the case discussed above, OSUB is outperformed by TS for a small number of arms (), while it outperforms TS for many arms (). The reason is the same above. Similarly, the ratio monotonically decreases as increases, from when to when , and the ratio monotonically increases as increases, from when to when . This confirms that the performance of UTS and the one of OSUB asymptotically match as increases when (as well as ). In order to investigate the reasons behind such a behavior, we produce an additional experiment with the line graphs of Combes and Proutiere combes2014unimodala except that the maximum expected reward is set to when and when (thus, given any edge with terminals and , we have ). What we observe (details of these experiments and those described below are in the appendices) is that, on average, OSUB outperforms UTS at suggesting that, when it is necessary to repeatedly distinguish between three arms that have very similar expected rewards, frequentist methods may outperform the Bayesian ones. This is no longer true when is much larger, e.g., , where UTS outperforms OSUB (interestingly, differently from what happens in the other topologies, in line graphs with very small , the average and cross a number of times during the time horizon). Futhermore, we evaluate how the relative performance of OSUB w.r.t. UTS varies for , observing it improves as decreases. Finally, we evaluate whether this behavior emerges also in Erdős-Rényi graphs in which where is a constant (we use ) and we observe that UTS outperforms OSUB, suggesting that line graphs with very small are pathological instances for UTS.

(a)
(b)
Figure 3: Results for the pseudo–regret in the setting with (a) and (b) and .

Conclusions and Future Work

In this paper, we focus on the Unimodal Multi–Armed Bandit problem with graph structure in which each arm corresponds to a node of a graph and each edge is associated with a relationship in terms of expected reward between its arms. We propose, to the best of our knowledge, the first Bayesian algorithm for the UMAB setting, called UTS, which is based on the well–known Thompson Sampling algorithm. We derive a tight upper bound for UTS that asymptotically matches the lower bound for the UMAB setting, providing a non-trivial derivation of the bound. Furthermore, we present a thorough experimental analysis showing that our algorithm outperforms the state–of–the–art methods.

In future, we will evaluate the performance of the algorithms considered in this paper with other classes of graphs, e.g., Barabási–Albert and lattices. Future development of this work may consider an analysis of the proposed algorithm in the case of time–varying environments, i.e., the expected reward of each arm varies over time, assuming that the unimodal structure is preserved. Another interesting study may consider the case of a continuous decision space.

References

  • [Alon et al.2013] Alon, N.; Cesa-Bianchi, N.; Gentile, C.; and Mansour, Y. 2013. From bandits to experts: A tale of domination and independence. In Advances in Neural Information Processing Systems, 1610–1618.
  • [Auer, Cesa-Bianchi, and Fischer2002] Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47(2-3):235–256.
  • [Caron and Bhagat2013] Caron, S., and Bhagat, S. 2013. Mixing bandits: A recipe for improved cold-start recommendations in a social network. In Proceedings of the 7th Workshop on Social Network Mining and Analysis,  11. ACM.
  • [Chapelle and Li2011] Chapelle, O., and Li, L. 2011. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, 2249–2257.
  • [Combes and Proutiere2014a] Combes, R., and Proutiere, A. 2014a. Unimodal bandits: Regret lower bounds and optimal algorithms. In ICML, 521–529.
  • [Combes and Proutiere2014b] Combes, R., and Proutiere, A. 2014b. Unimodal bandits without smoothness. arXiv preprint arXiv:1406.7447.
  • [Crandall et al.2008] Crandall, D.; Cosley, D.; Huttenlocher, D.; Kleinberg, J.; and Suri, S. 2008. Feedback effects between similarity and social influence in online communities. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 160–168. ACM.
  • [Edelman and Ostrovsky2007] Edelman, B., and Ostrovsky, M. 2007. Strategic bidder behavior in sponsored search auctions. Decision support systems 43(1):192–198.
  • [Erdős and Rényi1959] Erdős, P., and Rényi, A. 1959. On random graphs i. Publ. Math. Debrecen 6:290–297.
  • [Garivier and Cappé2011] Garivier, A., and Cappé, O. 2011. The kl-ucb algorithm for bounded stochastic bandits and beyond. In COLT, 359–376.
  • [Hoeffding1963] Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. Journal of the American statistical association 58(301):13–30.
  • [Jia and Mannor2011] Jia, Y. Y., and Mannor, S. 2011. Unimodal bandits. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 41–48.
  • [Kaufmann, Korda, and Munos2012] Kaufmann, E.; Korda, N.; and Munos, R. 2012. Thompson sampling: An asymptotically optimal finite-time analysis. In ALT, volume 7568 of Lecture Notes in Computer Science, 199–213. Springer.
  • [Kleinberg, Slivkins, and Upfal2008] Kleinberg, R.; Slivkins, A.; and Upfal, E. 2008. Multi-armed bandits in metric spaces. In

    Proceedings of the fortieth annual ACM symposium on Theory of computing

    , 681–690.
    ACM.
  • [Lai and Robbins1985] Lai, T. L., and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1):4–22.
  • [Mannor and Shamir2011] Mannor, S., and Shamir, O. 2011. From bandits to experts: On the value of side-observations. In NIPS. 684–692.
  • [Mas-Collel, Whinston, and Green1995] Mas-Collel, A.; Whinston, M. D.; and Green, J. R. 1995. Micreconomic theory.
  • [McPherson, Smith-Lovin, and Cook2001] McPherson, M.; Smith-Lovin, L.; and Cook, J. M. 2001. Birds of a feather: Homophily in social networks. Annual review of sociology 415–444.
  • [Thompson1933] Thompson, W. R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.
  • [Trovò et al.2015] Trovò, F.; Paladino, S.; Restelli, M.; and Gatti, N. 2015. Multi–armed bandit for pricing. In

    2th European Workshop on Reinforcement Learning (EWRL)

    .
    https://ewrl.wordpress.com/past-ewrl/ewrl12-2015/.
  • [Valko et al.2014] Valko, M.; Munos, R.; Kveton, B.; and Kocak, T. 2014. Spectral bandits for smooth graph functions. In Proceedings of The 31st International Conference on Machine Learning, ICML, 46–54.
  • [Xia et al.2015] Xia, Y.; Li, H.; Qin, T.; Yu, N.; and Liu, T.-Y. 2015. Thompson sampling for budgeted multi-armed bandits. In

    Twenty-Fourth International Joint Conference on Artificial Intelligence

    .