Partial monitoring is a generalisation of the multi-armed bandit framework with an interestingly richer structure. In this paper we are concerned with the finite-action version. Let
be the number of actions. A finite-action partial monitoring game is described by two functions, the loss functionand a signal function , where and and are topological spaces. At the start of the game the adversary secretly chooses a sequence of outcomes with , where is the horizon. The learner knows , and and sequentially chooses actions from . In round , after the learner chooses they suffer a loss of and observe only as a way of indirectly learning about the loss. A policy
is a function mapping action/signal sequences to probability distributions over actions (the learner is allowed to randomise) and the regret of policyin environment is
where the expectation is taken with respect to the randomness in the learner’s choices which follow . The minimax regret of a partial monitoring game is
where is the space of all policies. Our objective is to understand how the minimax regret depends on the horizon and the structure of and . Some classical examples of partial monitoring games are given in Table 1 and Fig. 5 in the appendix.
|Cops and robbers|
|Finite partial monitoring||arbitrary||arbitrary||arbitrary|
Although our primary objective is to shed light on the minimax adversarial regret, we establish our results by first proving uniform bounds on the Bayesian regret that hold for any prior. Then a new minimax theorem demonstrates the existence of an algorithm with the same minimax regret. While these methods are not constructive, we demonstrate that they lead to elegant analysis of various partial monitoring problems, and better control of the constants in the bounds.
Let be a space of probability measures on with the Borel -algebra. The Bayesian regret of a policy with respect to prior is
The minimax Bayesian optimal regret is
where is a space of policies so that is measurable, which we define formally in Section 3. When is clear from the context, we write in place of .
Our first contribution is to generalise the machinery developed by Russo and Van Roy (2016); Russo and Roy (2017) and Bubeck et al. (2015). In particular, we prove a minimax theorem for finite-action partial monitoring games with no restriction on either the loss or the feedback function. The theorem establishes that the Bayesian optimal regret and minimax regret are equal: . Next, the information-theoretic machinery of Russo and Roy (2017) is generalised by replacing the mutual information with an expected Bregman divergence. The power of the generalisation is demonstrated by showing that for -armed adversarial bandits, which improves on the best known bounds by a factor of . The rest of the paper is focussed on applying these ideas to finite partial monitoring games. The results enormously simplify existing analysis by sidestepping the complex localisation arguments. At the same time, our bounds for the class of ‘easy non-degenerate’ games do not depend on arbitrarily large game-dependent constants, which was true of all prior analysis. Finally, for a special class of bandits with graph feedback called cops and robbers, we show that , improving on prior work by a factor of .
2 Related work
Since partial monitoring is so generic, the related literature is vast, with most work focussing on the full information setting (see Cesa-Bianchi and Lugosi (2006)) or the bandit setting (Bubeck and Cesa-Bianchi (2012); Lattimore and Szepesvári (2019)). The information-theoretic machinery that we build on was introduced by Russo and Van Roy (2016); Russo and Roy (2017) in the context of minimizing the Bayesian regret for stationary stochastic bandits (with varying structural assumptions). Bubeck et al. (2015) noticed the results also applied to the ‘adversarial’ Bayesian setting and applied minimax theory to prove worst-case bounds for convex bandits. Minimax theory has also been used to transfer Bayesian regret bounds to adversarial bounds. For example, Abernethy et al. (2009) explores this in the context of online convex optimisation in the full-information setting and Gravin et al. (2016) for prediction with expert advice. The finite version of partial monitoring was introduced by Rustichini (1999)
, who developed Hannan consistent algorithms. The main challenge since then has been characterizing the dependence of the regret on the horizon in terms of the structure of the loss and signal functions. It is now known that all games can be classified into one of exactly four types. Trivial and hopeless, for whichand respectively. Between these extremes there are ‘easy’ games where and ‘hard’ games for which . The classification result is proven by piecing together upper and lower bounds from various papers (Cesa-Bianchi et al., 2006; Foster and Rakhlin, 2012; Antos et al., 2013; Bartók et al., 2014; Lattimore and Szepesvári, 2019). A caveat of the classification theorem is that the focus is entirely on the dependence of the minimax regret on the horizon. The leading constant is game-dependent and poorly understood. Existing bounds for easy games depend on a constant that can be arbitrarily large, even for fixed and . One of the contributions of this paper is to resolve this issue. Another disadvantage of the current partial monitoring literature, especially in the adversarial setting, is that the algorithms and analysis tend to be rather complicated. Although our results only prove the existence of an algorithm witnessing a claimed minimax bound, the Bayesian algorithm and analysis are intuitive and natural. There is also a literature on stochastic partial monitoring, with early analysis by Bartók et al. (2011). A quite practical algorithm was proposed by Vanchinathan et al. (2014). The asymptotics have also been worked out (Komiyama et al., 2015). Although a frequentist regret bound in a stochastic setting normally implies a Bayesian regret bound, in our Bayesian setup the environments are not stationary, while all the algorithms for the stochastic case rely heavily that the distribution of the adversary is stationary. Generalising these algorithms to the non-stationary case does not seem straightforward. Finally, we should mention there is an alternative definition of the regret that is less harsh on the learner. For trivial, easy and hard games it is the same, but for hopeless games the regret captures the hopelessness of the task and measures the performance of the learner relative to an achievable objective. We do not consider this definition here. Readers interested in this variation can consult the papers by Rustichini (1999); Mannor and Shimkin (2003); Perchet (2011) and Mannor et al. (2014).
3 Notation and conventions
The maximum/supremum of the empty set is negative infinity. The standard basis vectors inare . The column vector of all ones is . The standard inner product is . The th coordinate of vector is . The -dimensional probability simplex is . The interior of a topological space is and its boundary is . The relative entropy between probability measures and over the same measurable space is if and otherwise, where is the natural logarithm. When
is a random variable withalmost surely, then Pinsker’s inequality combined with straightforward inequalities shows that
where is the total variation distance. When , the squared Hellinger distance can be written as . Given a measure
and jointly distributed random elementsand we let denote the law of and (unconventionally) we let be the conditional law of given , which satisfies . One can think of as a random probability measure over the range of that depends on . In none of our analysis do we rely on exotic spaces where such regular versions do not exist. When is discrete we let denote for . With this notation the mutual information between and is . The domain of a convex function is . The Bregman divergence with respect to convex/differentiable is . For and , . When we define where the limit is taken to the boundary in the interior. The relative entropy between categorical distributions is the Bregman divergence between and where is the unnormalised negentropy: with domain . The diameter of a convex set with respect to is .
Probability spaces, policies and environments
The Borel -algebra on topological space is . Recall that and are assumed to carry a topology, which we will use for ensuring measurability of the regret. More about the choices of these topologies later. We assume the signal function is -measurable and the loss function is -measurable. A policy is a function and the space of all policies is . A policy is measurable if is -measurable for all , which coincides with the usual definition of a probability kernel. The space of all measurable policies is . In general is a strict subset of
. For most of the paper we work in the Bayesian framework where there is a prior probability measureon . Given a prior and a measurable policy , random elements and are defined on common probability space . We let and . Expectations are with respect to . For we let , and . Note that is the trivial algebra. The -algebra and the measure are such that
The law of the adversaries choices satisfies .
For any , the law of the actions almost surely satisfies
The existence of a probability space satisfying these properties is guaranteed by Ionescu-Tulcea (Kallenberg, 2002, Theorem 6.17). The last condition captures the important assumption that, conditioned on the observed history, is sampled independently from . In particular, it implies that and are independent under . The optimal action is . It is not hard to see that the Bayesian regret is well defined and satisfies
where . To minimise clutter, when the policy and prior are clear from the context, we abbreviate to .
4 Minimax theorem
Our first main result is a theorem that connects the minimax regret to the worst-case Bayesian regret over all finitely supported priors. The regret is well defined for any and any policy , but the Bayesian regret depends on measurability of . If is supported on a finite set , however, we can write
which does not rely on measurability. By considering finitely supported priors we free ourselves from any concern that might not be measurable. This also means that if (or ) came with some topologies, we simply replace them with the discrete topology (which makes all maps continuous and measurable, implying ).
Let be the space of all finitely supported probability measures on . Then
An equivalent statement of this theorem is that if and carry the discrete topology then , which is the form we prove in Appendix D. The strength of this result is that it depends on no assumptions except that the action set is finite.
Our proof borrows techniques from a related result by Bubeck et al. (2015). The main idea is to replace the policy space with a simpler space of ‘mixtures’ over deterministic policies, which is related to Kuhn’s celebrated result on the equivalence of behavioral and mixed strategies (Kuhn, 1953). We then establish that this space is compact and use Sion’s theorem to exchange the minimum and maximum. While we borrowed the ideas from Bubeck et al. (2015), our proof relies heavily on the finiteness of the action space, which allowed us to avoid any assumptions on and , which also necessitated our choice of . Neither of the two results imply each other.
Section 4 is a minimax theorem for a special kind of multistage zero-sum stochastic partial information game. Minimax theorems for this case are nontrivial because of challenges related to measurability and the use of Sion’s theorem. Although there is a rich and sophisticated literature on this topic, we are not aware of any result implying our theorem. Tools include the approach we took using the weak topology (Bernhard, 1992), or the so-called weak-strong topology (Leao et al., 2000) and reduction to completely observable games and then using dynamic programming (Ghosh et al., 2004). An interesting challenge is to extend our result to compact action spaces. One may hope to generalise the proof by Bubeck et al. (2015), but some important details are missing (for example, the measurable space on which the priors live is undefined, the measurability of the regret is unclear as is the compactness of distributions induced by measurable policies). We believe that the approach of Ghosh et al. (2004) can complete this result.
5 The regret information tradeoff
Unless otherwise mentioned, all expectations are with respect to the probability measure over interactions between a fixed policy and an environment sampled from a prior on . Before our generalisation we present a restatement of the core theorem in the analysis by Russo and Van Roy (2016). Let be the mutual information between and under . Although the proof is identical, the setup here is different because the prior is arbitrary.
[Russo and Van Roy (2016)] Suppose there exists a constant such that almost surely for all . Then .
This elegant result provides a bound on the regret in terms of the information gain about the optimal arm. Our generalisation replaces the information gain with an expected Bregman divergence. Let
, which is the posterior probability thatbased on the information available at the start of round .
Let be a differentiable convex function such that and . Suppose there exist constants such that
almost surely for all . Then .
Let with .
where the inequality follows from Fatou’s lemma. The second equality follows from the tower rule, . The last equality follows from the convexity of the potential , implying . Hence
where the first inequality follows from the assumption in the theorem, the second by Cauchy-Schwarz. The third by Eq. 3, telescoping and the definition of the diameter.
6 Finite-armed bandits
In the bandit setting the learner observes the loss of the action they play, which is modelled by choosing , and . The best known bound in this setting is , which holds for online mirror descent with an appropriate regularizer (Audibert and Bubeck, 2010; Lattimore and Szepesvári, 2019). Combining Sections 4 and 5 with the calculation in Proposition 3 by Russo and Van Roy (2016) leads to the bound . Here we demonstrate the power of Section 5 by proving that . We use the same potential as Audibert and Bubeck (2010), which is with domain , which is convex and has .
The minimax regret for -armed adversarial bandits satisfies .
below for Thompson sampling, which is the policy that samplesfrom .
For Thompson sampling: a.s..
Eq. 4 is true because the total variation distance is upper bounded by the Hellinger distance. Eq. 5 uses Cauchy-Schwarz and the fact that , which also follows from Cauchy-Schwarz. Eq. 6 follows from Bayes law and because by the definition of the algorithm. There is no difficulty with Bayes law here because both and live in Polish spaces (Ghosal and van der Vaart, 2017). Eq. 7 follows by introducing the sum over . The result follows because and a direct computation using the independence of and under (Lemma A in Appendix A).
Potentials other than the negentropy have been used in many applications in bandits and online convex optimisation. The log barrier, for example, leads to first order bounds for -armed bandits (Wei and Luo, 2018). Alternative potentials also appear in the context of adversarial linear bandits (Bubeck et al., 2012, 2018) and follow the perturbed leader (Abernethy et al., 2014). Investigating the extent to which these applications transfer to the Bayesian setting is an interesting direction for the future.
7 Finite partial monitoring games
Recall from Table 1 that a finite partial monitoring game is characterised by functions and where is a natural number and is arbitrary. Finite partial monitoring enjoys a rich linear structure, which we now summarise. A picture can help absorbing these concepts, and is provided with an example at the beginning of Appendix H. For , let be the vector with . Actions and are duplicates if . The cell associated with action is , which is the subset of distributions where action minimises . Note that is a closed convex polytope and its dimension is defined as the dimension of the affine space it generates. An action is called Pareto optimal if has dimension and degenerate otherwise. Of course , but cells may have nonempty intersection. When and are not duplicates, the intersection is a (possibly empty) polytope of dimension at most . A pair of Pareto optimal actions and are called neighbours if has dimension
. A game is called non-degenerate if there are no degenerate actions and no duplicate actions. So far none of the concepts have depended on the signal function. Local observability is a property of the signal and loss functions that allows the learner to estimate loss differences between actionsand by playing only those actions. Precisely, a non-degenerate game is locally observable if for all pairs of neighbours and there exist functions such that
In the standard analysis of partial monitoring the functions and are used to derive importance-weighted estimators of the loss differences, which are used by localalised versions of Exp3. In the following and are used more directly. A corollary of the classification theorem is that amongst the class of non-degenerate games with at least two actions, the minimax regret is if and only if the game is locally observable. The neighbourhood of is . The neighbourhood graph over has edges . For games without degenerate actions, the neighbourhood graph is connected. The following simple lemmas will be useful. The proofs are given in Appendix G.
Let be distinct actions in a non-degenerate game and . Then there exists an action such that . Furthermore, if , then .
Consider a non-degenerate game and let and and . Then the graph is connected.
For the remainder of this section we assume the game is non-degenerate and locally observable. A brief discussion of degenerate games is deferred to the discussion. A game is globally observable if the learner can estimate the loss differences between any pair of Pareto actions, possibly by playing other actions. Globally observable games are much more straightforward and are defined and analysed in Appendix C. The main theorem is the following, which improves on previous bounds that all depend on arbitrarily large game-dependent constants, even when and are fixed.
For any locally observable non-degenerate game: .
Before the proof we provide the algorithm, which seems to be novel among previous algorithms for partial monitoring. Note that Thompson sampling can suffer linear regret in partial monitoring (Appendix E). Let be the greedy action that minimises the -step Bayesian expected loss. The idea is to define a directed tree with vertex set and root and where all paths lead to . A little notation is needed. Define an undirected graph with vertices and edges by and , which is connected by Lemma 7. Note that when is unique, but this is not always the case. For let be the length of the shortest path from to in with by definition. Let be the ‘parent’ function:
The directed graph over vertex set with an edge from to if and is a directed tree with root .
Let be the set of ancestors of action in the tree defined in Lemma 7. We adopt the convention that . By the previous lemma, for all . Let be the set of descendants of , which does not include (Fig. 7). The depth of an action in round is the distance between and the root . We call an action anomalous in distribution in round if . Section 7 defines the ‘water transfer’ operator that corrects this deficiency by transferring mass towards the root of the tree defined in Lemma 7 while ensuring that (a) the loss suffered when playing the according to the transformed distribution does not increase and (b) the distribution is not changed too much. The process is illustrated in Fig. 2 in Appendix F, where you will also find the proof of the next lemma.
Let and . Then: (1) . (2) for all . (3) for all .
[h!] input: and tree determined by
find action at the greatest depth such that .
if no such action is found, let and return.
for let .
let be the largest such that
let if and otherwise.
Our new algorithm samples from . Because of the plumbing and randomisation, the new algorithm is called Mario sampling. The proof of Theorem 7 follows immediately from Theorems 4 and 5 and the following lemma.
For Mario sampling: a.s..
We assume an appropriate zero measure set is discarded so that we can omit the qualification ‘almost surely’ for the rest of the proof. By the first part of Lemma 7,
For let be a pair of functions such that and for all . The existence of such functions is guaranteed by Lemma 7 and the fact that and because we assumed the game is non-degenerate, locally observable. The expected loss of can be decomposed in terms of the sum of differences to the root,
In the same way,
Then, because and are -measurable,
In many games there exists a constant such that almost surely for all and . In this case Part 3 of Lemma 7 improves to