# Generalised Entropy MDPs and Minimax Regret

Bayesian methods suffer from the problem of how to specify prior beliefs. One interesting idea is to consider worst-case priors. This requires solving a stochastic zero-sum game. In this paper, we extend well-known results from bandit theory in order to discover minimax-Bayes policies and discuss when they are practical.

## Authors

• 1 publication
• 30 publications
06/11/2020

### On Worst-case Regret of Linear Thompson Sampling

In this paper, we consider the worst-case regret of Linear Thompson Samp...
02/01/2019

### An Information-Theoretic Approach to Minimax Regret in Partial Monitoring

We prove a new minimax theorem connecting the worst-case Bayesian regret...
12/10/2020

### Leveraging vague prior information in general models via iteratively constructed Gamma-minimax estimators

Gamma-minimax estimation is an approach to incorporate prior information...
06/19/2020

### Learning Minimax Estimators via Online Learning

We consider the problem of designing minimax estimators for estimating t...
05/09/2019

### Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

This paper establishes that optimistic algorithms attain gap-dependent a...
05/19/2019

### Teaching decision theory proof strategies using a crowdsourcing problem

Teaching how to derive minimax decision rules can be challenging because...
01/25/2019

### Gaussian One-Armed Bandit and Optimization of Batch Data Processing

We consider the minimax setup for Gaussian one-armed bandit problem, i.e...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this work, we consider the problem of a Bayesian agent interacting with a Markov decision process (MDP). However the agent is unsure of how to select its prior distribution and so it prefers a choice that is safe against a potentially adversarial Nature. The problem is how to select such a policy in a computationally efficient manner. We first recall the definition of an MDP.

###### Definition 1.

A Markov decision process on a state-action space is a tuple where is a set of states, is a set of actions, is a transition kernel, such that

 st+1∣st=s, at=a∼Ps,a,

where and is a (potentially random) horizon.

The agent’s utility is an additive function of individual rewards

 U≜T∑t=1rt. (1)

For simplicity, we can assume that the reward function is known to the agent and then for a finite state space

can be taken to be a fixed vector. For any MDP

and policy , the expected utility is , while the conditional expected utility is called the value function:

 Vπμ(s)≜Eπμ(T∑t=1rt ∣∣ ∣∣ st=s). (2)

For finite MDPs, and a

-geometrically distributed horizon

, the value function can be written as a vector , where

is the Markov chain induced on the MDP by the policy

. Since we are uncertain about , we can instead define a prior distribution on . Then

 Vπξ≜∫MVπμdξ(μ), (3)

will denote the value function under the particular distribution

on the MDPs. Finally, for a given probability measure

on , which can be taken to represent a starting state distribution, we define the utility of a particular policy to be:

 U(ξ,π)≜Eπξ(β⊤Vπμ)=∫Mβ⊤Vπμdξ(μ). (4)

There are two possible ways to interpret the measure , depending on how it is chosen. If is selected by the agent selecting , then it corresponds to the subjective belief of the decision maker about which is the most likely MDP a priori. Then corresponds to the expected utility of a particular policy under this belief. Let

 U∗(ξ)≜supπ∈ΠU(ξ,π) (5)

denote the Bayes-optimal utility for a belief. We recall the fact that this is a convex function (c.f. DeGroot, 1970). By definition, and due to convexity, the following bounds hold:

 U(ξ,π)≤U∗(ξ)≤∫ξU∗(μ)dξ(μ),∀π∈Π. (6)

In the above, the left hand side is the utility of an arbitrary policy, while the right side can be seen as the expected utility we would obtain if the true MDP was revealed to us.

The second view of is to assume that the MDP is actually drawn randomly from the distribution . If this is known, then the subjective value of a policy is equal to its true expected value. However, it is more interesting to consider the case where Nature selects some in an arbitrary way from a set of possible priors . Then we wish to find a policy achieving:

 maxπ∈Πminξ∈ΞU(ξ,π). (7)

One basic open question is whether the maximum exists. This is answered in the affirmative when the game between nature and the agent has a value, i.e.

 U∗=supπ∈Πinfξ∈ΞU(ξ,π)=infξ∈Ξsupπ∈ΠU(ξ,π)=U∗. (8)

Let and be the maximin policy and minimax prior respectively. If the game has a value then there exists an equalising policy which is optimal for some belief , and vice versa. A sufficient condition for this to occur is for to be convex and differentiable everywhere (c.f. Grünwald and Dawid, 2004). In order to study when this can occur, we first go over a couple of well-known facts.

## 2 Existence of maximin policies

###### Definition 2 (Policy).

Let be the set of all histories . A (stochastic) policy is a set of probability measures on the set of actions . We denote the set of all (history-dependent, stochastic) policies by .

###### Definition 3 (Deterministic policy).

A policy is deterministic if, for each sequence , there exists an action such that . We denote the set of deterministic policies by .

###### Definition 4 (Memoryless policy).

A policy is memoryless (or reactive) if, for all sequences , we have . We denote the set of memoryless (stochastic) policies by .

The set of memoryless deterministic policies is denoted by . Obviously, and .

###### Definition 5 (Mixed policy).

A mixed policy is a probability measure over policies. If is a set of base policies, we denote the set of probability measures over by .

###### Fact 1.

For any MDP there exists a deterministic, memoryless policy that is optimal, i.e. .

###### Fact 2.

For any distribution over MDPs, there exists a deterministic, history-dependent policy that is optimal, i.e.

A well-known game theoretic result is that an equalising policy can always be found in when is finite. However for fixed, finite , there may not exist a deterministic equalising policy. Then the number of possible policies is finite and consequently is piecewise-linear.

###### Remark 1.

If is finite, there exists a policy achieving the value of the game.

###### Proof.

For any mixed policy , there exists an equivalent stochastic policy in . This can be constructed by augmenting the state space to include the outcomes of fair coins. Now note that there exists an optimal mixed strategy achieving the value of the game, as the number of finite-horizon policies is finite. But there also exists a stochastic policy with the same distribution for all histories . ∎

Grünwald and Dawid (2004) make some interesting connections between maximum entropy and robust Bayesian decisions. In particular, they define the generalised entropy of a distribution to be minimum loss

achievable in a game between nature and a decision maker. In our setting, it is natural to consider the following two loss functions:

 L1(ξ,π)=β⊤(V∗ξ−Vπξ),L2(ξ,π)=β⊤Eξ(V∗μ−Vπμ), (9)

where is a distribution on the states. These corresponds to the regret of relative to the -optimal policy and to the oracle policy respectively. Now, let

be a set of probability distributions on

. One idea is to try and guard against the worst-case prior in a restricted set , by constraining the expectation of a statistic under the prior to be equal to the observed value of the statistic, .

 Ξϕ={ξ∈Ξ ∣∣ Eξ(Φ)=ϕ}. (10)

One possibiltiy is to use the cumulative state distribution for a particular policy , i.e.

 Φ(μ)=(I−γPπμ)−1, (11)

where a common choice for is the optimal policy for MDP , , used for example in Mannor and Shimkin (2003). In that case it is easy to see that . A sufficient condition for a policy to be robust Bayes against is for it to be linear, that is for all . Then it is also true (see Grünwald and Dawid, 2004, Theorem 7.1) that is an equalising policy against .

How can be calculated? When the set of MDPs is finite, then it is defined through the linear equation . Given, then, a sequence of observations, a statistic, and a resulting set of priors , one important question is how we can efficiently calculate such policies. We explain this in the following section.

## 3 Calculating robust policies

One potential solution involves finding the minimax prior and then the policy that is Bayes-optimal with respect to it. In previous work Koolen (2006) has shown that finding can be found via a concave-linear optimisation, for the ‘truth-finding’ game. However, this is not generally true. However, it has been shown by Freund and Schapire (1999), that multiplicative weighs algorithm can be used for zero-sum games, as long as an oracle that can compute best responses is available. In our setting, this would correspond to nature having the ability to efficiently construct a worst-case MDP given a policy. On the other hand Vrieze and Tijs (1982) show that the value in stochastic zero sum games can be achieved asymptotically via fictitious play even when the game matrix is known approximately, as long as the approximation converges to the true game matrix.

One idea is to apply results from the experts literature, such as the weighted majority algorithm (WMA). We start by assuming that in each round the Decision Maker has full access to the information regarding the rewards of the past round. That means that she can observe the outcomes of all the policies that were available previously.

Assume that is finite111If , a grid over the MDPs can be used to obtain a finite set. Then the approximation error for an MDP that has an -close transition matrix and mean reward from an MDP on the grid, will be bounded by .. Then and let be the vector of values for policy for the given set of Markov decision problems :

 uπ=(U(μ,π))μ∈M.

Assume that is finite and denote by the vector of values for each policy

 uξ=(U(ξ,π))π∈Π.

To apply WMA (Alg. 2) we execute policies in rounds. At each round, we select a policy from a distribution , and nature calculates chooses some prior . Moreover, denote by the total realized reward obtained by following policy , , in the -th round. Each

is a random variable that has an expected value, equal to

. Finally, denote by the vector of sampled utilities of all policies up to round :

 x(k)=(xπ1,k, xπ2,k,  … , xπN,k).

### 3.1 Analysis

The main issue is the computation of the value , which is used in steps 3 and 4. Depending on the sizes of the policy and MDP space accurate or approximate values for the quantity can be obtained.

#### When N (number of policies) and M (number of MDPs) are small.

Then we can retrieve the expected value of each policy , where is the kernel. The inverse operator will require . The expected reward for sampling a policy from the distribution is

 Eπ∼qk[xk,π]=x(k)⋅qk.

The total expected reward over all rounds is therefore

 V(K)WMA≜K∑k=1x(k)⋅qk.
###### Theorem 1 (Arora et al. (2012)).

The Multiplicative Weights algorithm guarantees that after rounds, for any distribution on the decisions, it holds:

 V(K)WMA≥K∑k=1(uξk−ℓ|uξk|)⋅Q−logeNℓ

where is the vector obtained by taking the coordinate-wise absolute value of the vector containing the expected values .

#### If M is small, but N is large

then computation of becomes difficult. We can approximate the true value of the expected values, though, by using a Monte Carlo sampling of iterations. We substitute

with its estimator

(lines 3,4 & 5 of algorithm WMA) and we call this modification of the algorithm as WMA-SR. Moreover, for each round , we need to introduce an estimation error term (which depends on the number of Monte Carlo iterations), since now the weights are updated by using approximations and not the true values.

The value earned by using WMA-SR, over all rounds is

 V(K)WMA-SR≜E(K∑k=1x(k)⋅qk)=K∑k=1uξ(k)⋅qk

where .

First, we prove a lemma for the approximated expected values.

###### Lemma 1.

Assume that all policy rewards lie in . Let . Then after rounds, it holds:

 V(K)WMA-SR≥K∑k=1^U(ξ(k),πi,S)−ℓK∑k=1|^U(ξ(k),πi,S)|−logeNℓ

for all .

Observe that if the rewards are not stochastic, then Lemma 1 is reduced to the standard expert setting(Arora et al., 2012; Freund and Schapire, 1999, c.f).

###### Theorem 2.

Assume that all policy rewards lie in . Let . Let Then after rounds, for the total expected rewards, it holds:

 V(K)WMA-SR≥K∑k=1U(ξ(k),πi)−ℓK∑k=1|U(ξ(k),πi)|−logeNℓ−K∑k=1E(S)(k)

for all , where , , , is the error term of the -th round (and denotes the number of Monte Carlo simulations):

and

 P(∣∣ ∣∣K∑k=1E(S)(k)<ε∣∣ ∣∣)≥1−2exp(−kε22).

The error bound is retrieved by applying Azuma’s Lemma, since is a martingale difference and all rewards lie in . We can also obtain a result for a distribution over ’s, .

###### Corollary 1.

After rounds, for any distribution on the decisions, it holds:

 K∑k=1uξ(k)⋅qk≥K∑k=1(uξ(k)−ℓ|uξ(k)|)⋅P−logeNℓ−K∑k=1E(S)(k)

where is the vector obtained by taking the coordinate-wise absolute value of .

###### Definition 6.

The regret of the learning algorithm against the optimal distribution is

 B(K)=K∑k=1uξ(k)⋅P⋆−K∑k=1uξ(k)⋅qk

Corollary 1 can be used in order to bound the regret. To that end, we first need the following theorem.

###### Theorem 3.

After rounds of applying the modified weighted majority algorithm WMA-SR, for any distribution it holds:

 K∑k=1uξ(k)⋅P−K∑k=1uξ(k)⋅qk≤2√logeNK+K∑k=1E(S)(k)
###### Corollary 2.

When algorithm WMA-SR is run with parameter then the regret of the algorithm is bound by .

One can also show that the algorithm converges by dividing the time into epochs. A choice of epochs that gives convergence with probability one is to define the length of each epoch as

and then proceed similarly to Freund and Schapire (1999) to show the following

###### Theorem 4.

Suppose we repeat the game for an unbounded number of rounds. Then for the regret of the algorithm it holds: for all but a finite number of values of and for every .

#### View as a bandit problem.

To avoid uniformly sampling all policies, we could cast this as a contextual bandit problem, mapping the prior selected at each round by nature to the context. With a slight modification of the linear context bandit algorithm presented in Auer (2002), we then obtain a similar bound, which is however linear in the number of policies, making this approach impractical. Since our estimates converge to the values of the policies, we can recover the value of the game if Nature uses fictitious play with respect to our estimates (Vrieze and Tijs, 1982).

## 4 Conclusion

We have discussed the links between robust reinforcement learning and maximin policies. In particular, an interesting idea to explore is to use maximin policies against a constrained set of priors

. Such as set is easy to define for a finite number of MDPs. However, to put the idea into practice we also need to establish computational procedures for approximately calculating such policies. Although this seems to be achievable for small problems, there does not appear to be a useful procedure for larger ones. One potential direction would be to choose statistics that inherently make the problem amenable to simple solutions.

## References

• Arora et al. [2012] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012.
• Auer [2002] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs.

Journal of Machine Learning Research

, 3:397–422, 2002.
• DeGroot [1970] Morris H. DeGroot. Optimal Statistical Decisions. John Wiley & Sons, 1970.
• Freund and Schapire [1999] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1):79–103, 1999.
• Grünwald and Dawid [2004] Peter D. Grünwald and A. Philip Dawid. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Annals of Statistics, 2004.
• Koolen [2006] Wouter Michiel Koolen. Discovering the truth by conducting experiments. Master’s thesis, University of Amsterdam, 2006.
• Mannor and Shimkin [2003] Shie Mannor and Nahum Shimkin. The empirical Bayes envelope and regret minimization in competitive Markov decision processes. Mathematics of Operations Research, 28(2):327–345, 2003. ISSN 0364-765X.
• Vrieze and Tijs [1982] O.J. Vrieze and S.H. Tijs. Fictitious play applied to sequences of games and discounted stochastic games. International Journal of Game Theory, 11(2):71–85, 1982. ISSN 0020-7276. doi: 10.1007/BF01769064.