Several important real-world problems, ranging from economics, engineering, and computer science involve multiple interactions of self-interested agents with coupled objectives. They can be modeled as repeated games and have received recent attention due to their connection with learning (e.g., cesa-bianchi_prediction_2006).
An important line of research has focused, on the one hand, on characterizing game-theoretic equilibria and their efficiency and, on the other hand, on deriving fast learning algorithms that converge to equilibria and efficient outcomes. Most of these results, however, are based on the assumption that the players always face the exact same game, repeated over time. While this leads to strong theoretical guarantees, it is often unrealistic in practical scenarios: In routing games roughgarden2007, for instance, the agents’ travel times and hence the ‘rules’ of the game are governed by many time-changing factors such as network’s capacities, weather conditions, etc. Often, players can observe such factors, and hence could take better decisions depending on the circumstances.
Motivated by these considerations, we introduce the new class of contextual games. Contextual games define a more general class of repeated games described by different contextual, or side, information at each round, denoted also as contexts in analogy with the bandit optimization literature (e.g., langford2008). Importantly, in contextual games players can observe the current context before playing an action, which allows them to achieve better performance, and converge to stronger notions of equilibria and efficiency than in standard repeated games.
Related work. Learning in repeated static games has been extensively studied in the literature. The seminal works hannan1957; hart2000 show that simple no-regret strategies for the players converge to the set of Coarse Correlated Equilibria (CCEs), while the efficiency of such equilibria and learning dynamics has been studied in blum2008; roughgarden2015. Exploiting the static game structure, moreover, syrgkanis2015; foster2016 propose faster learning algorithms, and a long array of works (e.g., Singh2000; Bowling2005; Balduzzi2018) study convergence to Nash equilibria. Learning in time-varying games, instead, has been recently considered duvocelle2018, where the authors show that dynamic regret minimization allows players to track the sequence of Nash equilibria, provided that the stage games are monotone and slowly-varying. Adversarially changing zero-sum games have also been studied cardoso19a, with convergence guarantees to the Nash equilibrium of the time-averaged game. Our contextual games model is fundamentally different than duvocelle2018; cardoso19a in that we assume players observe the current context (and hence have prior information on the game) before playing. This leads to new equilibria and a different performance benchmark, denoted as contextual regret, described by the best policies mapping contexts to actions. Perhaps closer to ours is the setup of stochastic (or Markov) games Shapley1095
, at the core of multi-agent reinforcement learning (seeBusoniu2010 for an overview). There, players observe the state of the game before playing but, differently from our setup, the evolution of the state depends on the actions chosen at each round. This leads to a nested game structure, which requires significant computational power and players’ coordination to compute equilibrium strategies via backward induction Greenwald2003; Dermed2009. Instead, we consider arbitrary contexts’ sequences (potentially chosen by an adversarial Nature) and show that efficient algorithms converge to our equilibria in a decentralized fashion.
From a single player’s perspective, a contextual game can be reduced to a special adversarial contextual bandit problem (bubeck2012, Chapter 4)
, for which several no-regret algorithms exist. All such algorithms, however, rely on high-variance estimates for the rewards of non-played actions and thus their performance degrades with the number of actions available. A fact not exploited by these algorithms is that in a contextual gamesimilar contexts and game outcomes likely produce similar rewards (e.g., in a routing game, similar network capacities and occupancy profiles lead to similar travel times). We encode this fact using kernel-based regularity assumptions and (similarly to sessa2019noregret in non-contextual games) show that exploiting these assumptions, and additionally observing the past opponents’ actions, players can achieve substantially improved performance compared to using standard bandit algorithms. For instance, for actions and adversarially chosen contexts from a finite set , the bandit -Exp3bubeck2012 incurs contextual regret, while our approach leads to a guarantee, where
is a sample-complexity parameter describing the degrees of freedom in the player’s reward function. For commonly used kernels, this results in a sublinear regret bound that grows onlylogarithmically in . Moreover, when contexts are stochastic and private to a player, we obtain a pseudo-regret bound. This bound should be compared to the pseudo-regret of balseiro2019 which –unlike us– assumes observing the rewards for non-revealed contexts, and the pseudo-regret of neu2020, which assumes known contexts distribution and a linear dependence of rewards on contexts in .
Contributions. We formulate the novel class of contextual games, a type of repeated games characterized by (potentially) different contextual information available at each round.
[itemsep= 0pt,topsep=1pt,leftmargin = 1.0em]
We identify the contextual regret as a natural benchmark for players’ individual performance, and propose novel online algorithms to play contextual games with no-regret. Unlike existing contextual bandit algorithms, our algorithms exploit the correlation between different game outcomes, modeled via kernel-based regularity assumptions, yielding improved performance.
We characterize equilibria and efficiency of contextual games, defining the new notions of contextual Coarse Correlated Equilibria (c-CCE) and optimal contextual welfare. We show that c-CCEs and contextual welfare can be approached in a decentralized fashion whenever players minimize their contextual regrets, thus recovering important game-theoretic results for our larger class of games.
We demonstrate our results in a repeated traffic routing application. Our algorithms effectively use the available contextual information (network capacities) to minimize agents’ travel times and converge to more efficient outcomes compared to other baselines that do not exploit the observed contexts and/or the correlations present in the game.
2 Problem Setup
We consider repeated interactions among agents, or players. At every round, each player selects an action and receives a payoff that depends on the actions chosen by all the players as well as the context of the game at that round. More formally, we let represent the (potentially infinite) set of possible contexts, and be the set of actions available to player . Then, we define to be the reward function of each player , where is the joint action space. Importantly, we assume is unknown to player . With the introduced notation, a repeated contextual game proceeds as follows. At every round :
Nature reveals context
Players observe and, based on it, each player selects action , for .
Players obtain rewards , .
Moreover, as specified later, player receives feedback information at the end of each round that it can use to improve its strategy. Let be the set of all policies , mapping contexts to actions. After game rounds, the performance of player is measured by the contextual regret:
The contextual regret compares the cumulative reward obtained throughout the game with the one achievable by the best fixed policy in hindsight, i.e., had player known the sequence of contexts and opponents’ actions ahead of time, as well as the reward function . Crucially, sets a stronger benchmark than competing only with the best fixed action and captures the fact that players should use the revealed context information to improve their performance. A strategy is no-regret for player if as .
Contextual games generalize the class of standard (non-contextual) repeated games, allowing the game to change from round to round due to a potentially different context (we recover the standard repeated games setup and regret definition by assuming for all ). In Section 4 we define new notions of equilibria and efficiency for such games and show that the contextual regret defined in (1), besides measuring individual players’ performance, has a close connection with game equilibria and efficiency. First, however, motivated by these considerations, we focus on the individual perspective of a generic player and seek to derive suitable no-regret strategies. In this regard, the algorithms and guarantees presented in the next section do not rely on the other players complying with any pre-specified rule, but consider the worst case over the opponents’ actions (also, potentially chosen as a function of the observed game data). To simplify notation, we denote player ’s reward function with , unless otherwise specified.
Even with a single known context, achieving no-regret is impossible unless we make further assumptions on the game cesa-bianchi_prediction_2006. We assume the action set is finite with .
We consider a generic set and make no assumptions on how contexts are generated (they could be adversarially chosen by Nature, possibly as a function of past game rounds). In Section
3.3, however, we consider a special case where contexts are sampled i.i.d. from a static distribution . Our next regularity assumptions concern the reward function .
Let . We assume the unknown function has a bounded norm in a Reproducing Kernel Hilbert Space (RKHS) associated with a positive-semidefinite kernel function . Kernel measures the similarity between two different context-action pairs, and the norm measures the smoothness of with respect to . This is a standard non-parametric assumption used, e.g., in Bayesian optimization srinivas2009gaussian and recently exploited to model correlations in repeated games sessa2019noregret. It encodes the fact that similar action profiles (e.g., similar network’s occupancy profiles in a routing game) lead to similar rewards (travel times), and allows player to generalize experience to non-played actions and, in our case, unseen contexts. Popularly used kernels include polynomial, Squared Exponential (SE), and Matérn kernels rasmussen2006gaussian, while composite kernels krause2011contextual can also be used to encode different dependences of on , , and context .
We assume player receives a noisy bandit observation of the reward at each round, where is -sub-Gaussian (i.e., , ) and independent over time. Moreover, we assume, similar to sessa2019noregret, that at the end of each round, player also observes the actions chosen by the other players. The latter assumption will allow player to achieve improved performance compared to the standard bandit feedback. In some applications (e.g., aggregative games such as traffic routing), it is only sufficient to observe an aggregate function of .
3 Algorithms and Guarantees
From the perspective of player , playing a contextual game corresponds to an adversarial contextual bandit problem (see, e.g., (bubeck2012, Chapter 4)) where, at each round, context is revealed, an adversary picks a reward function and player obtains reward . Therefore, player could in principle use existing adversarial contextual bandits algorithms to achieve no-regret. Such algorithms come with different regret guarantees depending on the assumptions made but, importantly, incur a regret which scales poorly with the size of the action space . This is because they use high-variance estimators to estimate the rewards of non-played actions, i.e., the so-called full information feedback. Also, some of these algorithms assume parametric (e.g., linear neu2020) dependence of the rewards on the context and hence cannot deal with our general game structure.
Instead, we exploit the fact that in a contextual game the rewards obtained at different times are correlated through the reward function (i.e., contextual games correspond to the specific contextual bandit problem where for all ). This fact, together with our feedback model, allows player to use past game data to obtain (with increasing confidence) an estimate of the reward function and use it to emulate the full-information feedback.
Using past game data
, standard kernel ridge regressionrasmussen2006gaussian can be used to compute posterior mean and corresponding variance estimates of the reward function . For any , and regularization parameter , they can be obtained as:
where , , and is the kernel matrix. Moreover, such estimates can be used to build the upper confidence bound function:
where is a confidence parameter, and the function is truncated at since for all . A standard result from srinivas2009gaussian shows that can be chosen such that
with high probability for any(see Lemma 7 in the Appendix). The function hence represents an optimistic estimate of and can be used by player to emulate the full-information feedback. We outline our (meta) algorithm c.GP-MW in Algorithm 1.
c.GP-MW extends and generalizes the recently proposed GP-MW sessa2019noregret algorithm to play repeated games, to the case where contextual information is available to the players and the goal is to compete with the optimal policy in hindsight. At each time step, after observing context the algorithm computes a distribution , where is the -dimensional simplex, and samples an action from it. At the same time, the algorithm uses the observed game data to construct upper confidence bound functions of the player’s rewards using (3). Such functions, together with the observed context are used to compute the distribution at each round. Note that c.GP-MW is a well-defined algorithm after we specify the rule used to compute (line 3 of Algorithm 1). We left such rule unspecified, as we will specialize it to different settings throughout this section.
The regret bounds obtained in this section depend on the so-called maximum information gain srinivas2009gaussian about the unknown function from noisy observations, defined as:
This quantity represents a sample-complexity parameter which, importantly, for popularly used kernels does not grow with the number of actions but only with the dimension of the domain . It can be bounded analytically as, e.g., and for squared exponential and linear kernels, respectively srinivas2009gaussian. Moreover, we remark that although in the worst case grows linearly with the number of players , in many applications (such as the traffic routing game considered in Section 5) the reward function depends only on some aggregate function of the opponents’ actions and therefore is independent from the number of players in the game.
3.1 Finite (small) number of contexts
When the context set is finite, a basic version of c.GP-MW achieves a high-probability regret bound of : We simply maintain a distribution for each context and update it only when is observed, using the Multiplicative Weights (MW) method littlestone1994. We formally introduce and study such strategy in Appendix A.1. In the same setting, and with standard bandit feedback, the -Exp3bubeck2012 algorithm achieves regret , which has a worse dependence on the number of actions . These regret bounds, however, are appealing only when the set has low cardinality, and become worthless otherwise. Intuitively, this is because no information is shared across different contexts and each context is treated independently from each other.
3.2 Exploit similarity across contexts
For large or even infinite , we want to exploit the fact that similar contexts should lead to similar performance and take this into account when computing the action distribution . We capture this fact by assuming the optimal policy in hindsight is -Lipschitz:
Moreover, we assume to obtain a scale-free regret bound, and that the reward function is -Lipschitz with respect to the decision set , i.e., , which is readily satisfied for most kernels (de2012regret, Lemma 1).
These assumptions allow player to share information across different, but similar, contexts to improve the performance. This can be done by using the online Strategy 2 to compute at each round (Line 3 in Algorithm 1).
be the uniform distribution.
Such strategy consists of building, in a greedy fashion as new contexts are revealed, an -net krauthgamer04 of the context space , similarly to the algorithm by hazan2007 for online convex optimization: At each time , either a new L1-ball centered at is created or is assigned to the closest ball. In the latter case, is computed via a MW rule using the sequence of functions for those time steps that
belongs to such ball. Note that Strategy 2 can also be implemented recursively, by maintaining a probability distribution for each new ball and updating only the one corresponding to the ballbelongs to. The radius is a tunable parameter, which can be set as follows.
Fix and assume , is -Lipschitz, and is -Lipschitz in . If player plays according to c.GP-MW using Strategy 2 with , , , and , then with probability at least ,
Compared to Section 3.1, the obtained regret bound is now independent of the size of , although its sublinear dependence on degrades with the contexts’ dimension . The additive term represents the cost of learning the reward function online. Note that even if was known, and hence full-information feedback was available, the and rates obtained so far are shown optimal in their respective settings, i.e., when is finite bubeck2012 or when the discussed Lipschitz assumptions are satisfied hazan2007. An interesting future direction is to understand whether more refined bounds can be derived as a function of the contexts’ sequence using adaptive partitions as proposed by slivkins2011. In the next section we show that significantly improved guarantees are achievable when contexts are i.i.d. samples from a static distribution.
Finally, we remark that all the discussed computations are efficient, as they do not iterate over the set of policies (which has exponential size). Improved regret bounds can be obtained if this requirement is relaxed, e.g., assuming a finite pool of policies auer2003, or a value optimization oracle syrgkanis2016contextual. We believe such results are complementary to our work and can be coupled with our RKHS game assumptions.
3.3 Stochastic contexts and non-reactive opponents
In this section, we consider a special case where contexts are i.i.d. samples from a static distribution , i.e., for . Importantly, we consider the realistic case in which player does neither know, nor can sample from, such distribution. Moreover, we focus on the setting where the opponents’ decisions are not based on the current realization of , but can only depend on the history of the game. Examples of such a setting are games where the context represents ‘private’ information for player (e.g., in Bayesian games, can represent player ’s type hartline2015), or where is only relevant to player and hence the opponents have no reason to decide based on it.
In this case, we analyze the following strategy to compute at each round (Line 3 in Algorithm 1):
Crucially, the distribution is now computed using the whole sequence of past functions, evaluated at context , regardless of whether was observed in the past. Hence, while according to rule (4) – and most of the MW algorithms – can be updated in a recursive manner, using rule (5) such distribution is re-computed at each round after observing (this requires storing the previous functions, or re-computing them using (2) at each round).111We note that strategy (5) can be implemented recursively in case the contexts’ set is known and finite. Such strategy exploits the stochastic assumption on the contexts and reduces the contextual game to a set of auxiliary games, one for each context. This idea was recently used also by balseiro2019; neu2020 in the finite and linear contextual bandit setting, respectively, while we specialize it to repeated games coupled with our RKHS assumptions.
The next theorem provides a pseudo-regret bound for c.GP-MW when using strategy (5), i.e., we bound the quantity (expectation with respect to the contexts’ sequence and the randomization of c.GP-MW), where is the regret with respect to a generic policy . Note that the pseudo-regret is smaller than the expected contextual regret which, however, is proven to grow linearly with when is sufficiently large balseiro2019. Nevertheless, (balseiro2019, Theorem 22) shows that can be bounded assuming each context occurs sufficiently often.
Fix and assume and for all . Moreover, assume the opponents cannot observe the current context . If player plays according to c.GP-MW using strategy (5) with , and , then with probability at least ,
where expectation is with respect to both the contexts’ sequence and the randomization of c.GP-MW.
The above guarantee significantly improves upon the ones obtained in the previous sections, as it does not depend on the context space , and matches the regret of GP-MW in non-contextual games. It should be compared with the guarantee of balseiro2019 which assumes rewards for the non-revealed contexts are also observed, and the bandit pseudo-regret of neu2020 assuming a linear dependence between contexts and rewards and a known contexts distribution. Exploiting our game assumptions, c.GP-MW’s performance decreases only logarithmically with , relies on a more realistic feedback model than balseiro2019 and can deal with more complex rewards structures than neu2020.
4 Game Equilibria and Efficiency
In this section, we introduce new notions of equilibria and efficiency for contextual games. We recover game-theoretic learning results hart2000; roughgarden2015 showing that equilibria end efficiency (as defined below) can be approached when players minimize their contextual regret.
4.1 Contextual Coarse Correlated Equilibria
A typical solution concept of multi-player static games is the notion of Coarse Correlated Equilibria (CCEs) (see, e.g., (roughgarden2015, Section 3.1)). CCEs include Nash equilibria and have received increased attention because of their amenability to learning: a fundamental result from hart2000 shows that CCEs can be approached by decentralized no-regret dynamics, i.e., when each player uses a no-regret algorithm. These results, however, are not applicable to contextual games, where suitable notions of equilibria should capture the fact that players can observe the current context before playing. To cope with this, we define a notion of CCEs for contextual games, denoted as contextual CCE (c-CCE).
Definition 3 (Contextual CCE).
Consider a contextual game described by contexts . Let be the set of all policies for player , and be the joint space of actions . A contextual coarse-correlated equilibrium (c-CCE) is a policy such that:
As opposed to CCEs (which are elements of ), a c-CCE is a policy from which no player has incentive to deviate looking at the time-averaged expected reward. In other words, suppose there is a trusted device that, for any context , samples a joint action from , where is a c-CCE. Then, in expectation, each player is better off complying with such device, instead of using any other . We say that is a -c-CCE if inequality (6) is satisfied up to a accuracy. Finally, we remark that c-CCEs reduce to CCEs in case for all .
Example (c-CCEs in traffic routing)
In traffic routing applications, the trusted device can be a routing system (e.g., a maps server) which, given current weather, traffic conditions, or other contextual information, decides on a route for each user. If such routes are sampled according to a c-CCE, then each user is better-off complying with such device to ensure a minimum expected travel time.
The next proposition shows that, similarly to CCEs in static games, c-CCEs can be approached whenever players minimize their contextual regrets. Hence, it provides a fully decentralized and efficient scheme for computing -c-CCEs. To do so, we define the notion of empirical policy at round as follows. After game rounds, let be the set of all the distinct observed contexts. Then, the empirical policy at round is the policy such that, for each , is the empirical distribution of played actions when context was revealed, while for the unseen contexts , is an arbitrary (e.g., uniform) distribution.
Proposition 4 (Finite-time approximation of c-CCEs).
After game rounds, let ’s denote the players’ contextual regrets and be the empirical policy at round . Then, is a -c-CCE of the played contextual game with .
Proposition 4 implies that, as , if players use vanishing contextual regret algorithms (such as the ones discussed in Section 3), then the empirical policy converges to a c-CCE of the contextual game. When contexts are stochastic, i.e., for all , an alternative notion of c-CCE can be defined by considering the expected context realization. We treat this case in Appendix B.2 and prove similar finite-time and asymptotic convergence results using standard concentration arguments.
4.2 Approximate Efficiency
The efficiency of an outcome in a non-contextual game, i.e., for a fixed context , can be quantified as the distance between the social welfare (where is sometimes replaced by if the reward of the game authority is also considered), and the optimal welfare . Optimal welfare is typically not achieved as players are self-interested agents aiming at maximizing their individual rewards, instead of . Nevertheless, main results of blum2008; roughgarden2015 show that such efficiency loss can be bounded whenever players minimize their regrets and the game is -smooth, i.e., if for any pair of outcomes and , it satisfies:
Examples of smooth games are routing games with polynomial delay functions, several classes of auctions, submodular welfare games, and many more (see, e.g., syrgkanis2013; roughgarden2015; sessa2019bounding and references therein).
In contextual games, on the other hand, a different context describes the game at each round, and hence the social welfare of an outcome depends on the specific context realization. The efficiency of a contextual game can therefore be quantified by the optimal contextual welfare:
Definition 5 (Optimal contextual welfare).
Equation (8) generalizes the optimal welfare for non-contextual games, and sets the stronger benchmark of finding the policies (instead of static actions) mapping contexts to actions which maximize the time-averaged social welfare. In routing games, for instance, it corresponds to finding the best routes for the agents as a function of current traffic conditions. As shown in our experiments (Section 5), such policies can significantly reduce the network’s congestion compared to finding the best static routes. The next proposition generalizes the well-known results of roughgarden2015, showing that in a smooth contextual game such optimal welfare can be approached when players minimize their contextual regret. First, we note that a contextual game can satisfy the smoothness condition (7) for different constants , depending on the context , and hence we use the notation to highlight their dependence.
Proposition 6 (Convergence to approximate efficiency).
Let ’s be the players’ contextual regrets, and assume the game is -smooth at each time . Then,
where and .
The approximation factor (also known as Price of Total Anarchy blum2008) depends on the constants and , which in our case represent the ‘worst-case’ smoothness of the game. We remark however that game smoothness is not necessarily context-dependent, e.g., routing games (such as the one considered in the next section) are smooth regardless of the network’s size and capacities roughgarden2015.
5 Experiments - Contextual Traffic Routing Game
We consider a contextual routing game on the traffic network of Sioux-Falls, a directed graph with nodes and edges, with the same game setup of sessa2019noregret (network data and congestion model are taken from leblanc1975, while Appendix C gives a complete description of our experimental setup). There are agents in the network. Each agent wants to send units from a given origin to a given destination node in minimum time, and can choose among routes at each round. The traveltime of an agent depends on the routes chosen by the other agents (if all choose the same route the network becomes highly congested) as well as the network’s capacity at round . We let represent the route chosen by agent at round , where if edge belongs to such route, and otherwise. Moreover, we let context represent the capacity of the network’s edges at round (capacities are i.i.d. samples from a fixed distribution, see Appendix C). Then, agents’ rewards can be written as , where and ’s are the edges’ traveltime functions, which are unknown to the agents. Note that, strictly speaking, agent ’s reward depends only on the entries for , where is the subset of edges that agent could potentially traverse. According to our model, we assume agent observes context and, at end of each round, the edges’ occupancies .
We let each agent select routes according to c.GP-MW (using rule (4) or (5)) and compare its performance with the following baselines: 1) No-Learning, i.e., agents select the shortest free-flow routes at each round, 2) GP-MW sessa2019noregret which neglects the observed contexts, and 3) RobustLinExp3 neu2020 for contextual linear bandits, which is robust to model misspecification but does not exploit the correlation in the game (also, it requires knowing the contexts’ distribution). To run c.GP-MW we use the composite kernel , while for GP-MW the kernel , where are linear and polynomial kernels respectively. We set according to Theorems 1 and 2, and (theoretical values for are found to be overly conservative srinivas2009gaussian; sessa2019noregret). For rule (4) we set . Figure 1 shows the time-averaged losses (i.e., traveltimes scaled in and averaged over all the agents), which are inversely proportional to the game welfare, and the resulting network’s congestion (computed as in Appendix C). We observe, in line to what discussed in Section 4, that minimizing individual regrets the agents increase the game welfare (this is expected as routing games are smooth roughgarden2007). Moreover, when using c.GP-MW agents exploit the observed contexts and correlations, and achieve significantly more efficient outcomes and lower congestion levels compared to the other baselines. We also observe strategy (5) outperforms strategy (4) in our experiments. This can be explained by the contexts being stochastic and, also, since each agent is only influenced by the coordinates of the relevant edges .
We have introduced the class of contextual games, a type of repeated games described by contextual information at each round. Using kernel-based regularity assumptions, we modeled the correlation between different contexts and game outcomes, and proposed novel online algorithms that exploit such correlations to minimize the players’ contextual regret. We defined the new notions of contextual Coarse Correlated Equilibria and optimal contextual welfare and showed that these can be approached when players have vanishing contextual regret. The obtained results were validated in a traffic routing experiment, where our algorithms led to reduced travel times and more efficient outcomes compared to other baselines that do not exploit the observed contexts or the correlation present in the game.
As systems using machine learning get deployed more and more widely, these systems increasingly interact with each other. Examples range from road traffic over auctions and financial markets, to robotic systems. Understanding these interactions and their effects for individual participants and the reliability of the overall system becomes ever more important. We believe our work contributes positively to this challenge by studying principled algorithms that are efficient, while converging to suitable, and often efficient, equilibria.
This work was gratefully supported by the Swiss National Science Foundation, under the grant SNSF _, by the European Union’s ERC grant , and ETH Zürich Postdoctoral Fellowship 19-2 FEL-47.
Appendix A Supplementary Material for Section 3
The theoretical guarantees obtained in Section 3 rely on the following two main lemmas. The first lemma is from [abbasi2013online] and shows that given the previously observed rewards, contexts, and players’ actions, the reward function of player belongs (with high probability) to the interval , for a carefully chosen confidence parameter .
Let such that and consider the kernel-ridge regression mean and standard deviation estimates
and consider the kernel-ridge regression mean and standard deviation estimatesand , with regularization constant . Then for any , with probability at least , the following holds simultaneously over all and :
The second main lemma concerns the properties of the Multiplicative Weights (MW) update method [littlestone1994], which is used as a subroutine in our algorithms to compute the action distribution (Line 3 of Algorithm 1) at each round. Its proof follows from standard online learning arguments equivalently to, e.g., [Mourtada2019, Proposition 1].
Consider a sequence of functions and let ’s be the distributions computed using the MW rule:
where is initialized as the uniform distribution. Then, provided that is a decreasing sequence, for any action :
a.1 The case of a finite (and small) number of contexts
In this section we consider the simple case of a finite (and small cardinality) set of contexts . In such a case, a high-probability regret bound of can be achieved when c.GP-MW is run with the following strategy:
That is, is computed using the sequence of previously computed upper confidence bound functions for the game rounds in which the specific context was revealed.
Fix and assume . If player plays according to c.GP-MW using strategy (10) with , , and , then with probability at least ,
Let . Our goal is to bound with high probability.
By conditioning on the event of the confidence lemma (Lemma 7) holding true, we can state that, with probability at least ,
The first inequality follows by the definition of (see (3)), the specific choice of the confidence level , and Lemma 7. The last inequality follows by [srinivas2009gaussian, Lemma 5.4]. The rest of the proof proceeds to show that, with probability at least ,
The theorem statement then follows by a standard union bound argument.
First, by straightforward application of the Hoeffding–Azuma inequality (e.g., [cesa-bianchi_prediction_2006, Lemma A.7]), it follows that with probability at least ,
since the variables ’s form a martingale difference sequence, being the expected value of conditioned on the history and on context . Then, using (13), can be bounded, with probability , as
At this point, we can use the properties of the MW rule used to compute the distribution . Note that, for each context , the distribution computed by c.GP-MW precisely follows the MW rule (9) with the sequence of functions and the sequence of learning rates , where is the number of times context was revealed. Hence, we can apply Lemma 8 for each context and obtain: