# Playing with and against Hedge

Hedge has been proposed as an adaptive scheme, which guides an agent's decision in resource selection and distribution problems that can be modeled as a multi-armed bandit full information game. Such problems are encountered in the areas of computer and communication networks, e.g. network path selection, load distribution, network interdiction, and also in problems in the area of transportation. We study Hedge under the assumption that the total loss that can be suffered by the player in each round is upper bounded. In this paper, we study the worst performance of Hedge.

## Authors

• 1 publication
• 1 publication
• ### Online Multi-Armed Bandit

We introduce a novel variant of the multi-armed bandit problem, in which...
07/17/2017 ∙ by Uma Roy, et al. ∙ 0

• ### Interactive Restless Multi-armed Bandit Game and Swarm Intelligence Effect

We obtain the conditions for the emergence of the swarm intelligence eff...
03/13/2015 ∙ by Shunsuke Yoshida, et al. ∙ 0

• ### Reinforcement Learning for Optimal Load Distribution Sequencing in Resource-Sharing System

Divisible Load Theory (DLT) is a powerful tool for modeling divisible lo...
02/05/2019 ∙ by Fei Wu, et al. ∙ 0

• ### Shrewd Selection Speeds Surfing: Use Smart EXP3!

In this paper, we explore the use of multi-armed bandit online learning ...
12/08/2017 ∙ by Anuja Meetoo Appavoo, et al. ∙ 0

• ### A Central Limit Theorem, Loss Aversion and Multi-Armed Bandits

This paper establishes a central limit theorem under the assumption that...
06/10/2021 ∙ by Zengjing Chen, et al. ∙ 0

• ### The Road to VEGAS: Guiding the Search over Neutral Networks

VEGAS (Varying Evolvability-Guided Adaptive Search) is a new methodology...
07/19/2012 ∙ by Marie-Eleonore Marmion, et al. ∙ 0

• ### Distributed Learning in Ad-Hoc Networks: A Multi-player Multi-armed Bandit Framework

Next-generation networks are expected to be ultra-dense with a very high...
03/06/2020 ∙ by Sumit J. Darak, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 The game model

Firstly we summarize the generalized full information multi-armed bandit game: A machine offers betting options (or “arms”). In each round the player bets an amount of money by assigning a fraction () of to option . The machine responds by determining a penalty (or reward) factor for each option ; effectively the player’s result in the current round is equal to

 R=N∑i=1piM×ℓi=MN∑i=1pi×ℓi (1)

The game is played for a pre-determined total number of rounds. The player aims at minimizing the total loss or maximizing the total reward over the whole game.

This game is known as the full information version of the multi-armed bandit game [Rob85]

since the full penalty vector

is announced to the player after each round, the other option being to announce only the outcome . Also, the game is called generalized in the sense that the player is allowed to play mixed strategies, i.e. to distribute the total bet to different options in the same round.

An ordinary assumption is that the bet per round remains constant; in the following, we shall set without loss of generality. However, a less ordinary assumption that we use in this paper is that the loss that can be inflicted on the player in each round is upper bounded. We express this restriction by upper-bounding the sum of penalties. Again, if we normalize the penalties w.r.t. to the upper bound, we can write

 N∑i=1ℓi≤1 (2)

In other words the machine works within a limited penalty budget in each round. Effectively the normalization assumption together with (2) imply that , thereby restricting the maximum loss per round to one unit. The s can become rewards by assigning them negative and restricting their sum to be greater than -1. In this paper we use losses.

### 1.2 A fair game interpretation

The loss based game seems unfair, since the player always suffers losses, but it can easily be converted to a fair game: Suppose that in each round the player is awarded with an amount equal to , therefore the gain of the player in a single round is

 g=1N−N∑i=1pi×ℓi.

Then the player could choose , which implies . Effectively, if the player insists on bet equipartition, no positive gains can be achieved either. If the player can predict the option with the minimum loss that will appear in the next round, say option , the gain will become (since ) by choosing . For an arbitrary bet distribution vector the gain can be either positive or negative. The worst possible gain outcome is , and will appear if the player is extremely unlucky, so as to bet exclusively on the worst option.

It can also be shown that a uniformly random selection of weights and losses will produce an average loss equal to , thus a neutral outcome of the gain. Any successful prediction can tilt the balance of the game in favor of the player. A well known fact in the literature is that if the arms of the bandit behave according to stationary processes, the Hedge algorithm (see Section 1.3) brings the per round gain asymptotically to the optimum, i.e. towards . Effectively, Hedge will take advantage of any difference in the statistical behavior of the arms, and due to 2 it will produce positive gains.

However, in this paper we explore the performance of Hedge when there is no stationarity or other assumption that restricts the multi-armed bandit behavior. In fact, we determine the behavior of the bandit that results into the worst possible outcome of the player, when the player has adopted Hedge as a decision making aid.

### 1.3 The player’s hand and Hedge algorithm

The player aims at minimizing the total loss over the whole game; this aim can only be achieved by suitably controlling the bet distribution in each round. Clearly, if the player could predict the result of the current round, she would choose the option with the minimum loss and bet the available unit of money on this option. However, the player can only rely on past outcomes.

This is where the Hedge or multiplicative updates algorithm [SE95] enters in order to possibly guide the player’s hand. Hedge maintains a vector of weights, such that (, and ). In each round Hedge chooses the bet allocation to be

 pti=wti∑Ni=1wti (3)

When the ’th round loss vector is revealed, is updated by using

 wt+1i=wtiβℓti (4)

for some fixed , such that .

Our objective is to find the worst possible performance of Hedge over all possible responses of the multi-armed bandit during a game of fixed duration.

### 1.4 Overview of previous results

In [ACBFS95] Auer et al. have proved that the performance of Hedge is asymptotically optimal for the non generalized game, the error being . Note that the theorem holds for players that do not use mixed strategies. In the “optimal” game the player is supposed to choose the same overall most favorable option in each round, i.e. the option with the lowest loss over the whole game.

The model used in [ACBFS95] effectively considers a machine with arms that is somehow imperfect and produces statistically unbalanced results. In a perfect machine all arms would be identical, but in an imperfect machine there is an arm that statistically produces the most favorable results. The intention of the player is to discover exactly this arm. Since the player cannot predict the outcome of any arm in a specific round, the optimal strategy of the player would be to always bet on the same most favorable arm. It would take possibly a lot of time and money before a real player would be able to gather enough statistics, that will determine the best arm. The aforementioned theorem guarantees that Hedge somehow digs out this information for the player that has no previous experience with the machine.

Hedge has become quite popular recently and in different fields. For example, Chastain et al. have shown that Hedge is also applicable in coordination games and evolution theory and have noted that this algorithm “has surprised us time and again with its seemingly miraculous performance and applicability” [CLPV14].

There is also a number of Hedge variations and alternatives. Auer, Cesa-Bianchi et al. have proposed the Exp3 algorithm in [ACBFS02]. Allenberg-Neeman and Neeman proposed the GL (Gain-Loss) algorithm, for the full information game [ANN04]. Dani, Hayes, and Kakade have proposed the GeometricHedge algorithm in [DKH08]. A modification was also proposed by Bartlett, Dani et al. in [BDH08]. Cesa-Bianchi and Lugosi have proposed the ComBand algorithm for the bandit version [CBL12]. A comparison of algorithms can be found in the paper by Uchiya et al. [UNK10].

In two previous works of ours [AL14, AL15] we have mentioned a number of application areas and problems, that could make use of the multi-armed bandit game with upper-bounded loss in a each round, i.e. as modified by (2). Possible applications include path selection in computer networking, and network interdiction with applications in the area of transportation and border control.

In the aforementioned works we have studied the worst possible performance of Hedge (when it guides the player’s hand in the aforementioned multi-armed bandit game) by exploring the penalty sequence vectors that maximize the player’s total loss. In particular, in [AL14] we have given a methodology for the calculation of the maximum total loss that is suffered by the player, and also for the calculation of the associated penalty vectors that are produced by the multi-armed bandit machine. We have also provided a formula for the total loss and the associated penalties that is a recursive function of the number of rounds and can be exploited for the numerical calculation of the performance of multiple round games. By using this recursive calculation we can get a fair idea of the “optimal” behavior of the machine (which is of course worst for the player). In general the resulting optimal penalty vectors are real valued, but it appears that they become binary after a number of rounds. In [AL14] we have shown that if the penalties are restricted to binary values, the machine’s optimal plan (worst for the player) can be easily calculated by using a greedy algorithm. A consequence of greediness is that the penalty plan becomes periodic, and the losses suffered by the player become also periodic.

In [AL15] we have given details and examples of analytic calculation of the optimal plan from the machine’s viewpoint, together with the resulting total loss. The results have shown that the optimal penalties are generally non binary, at least in the first few rounds. However, the calculation of the optimal penalty plan in terms of closed form formulas appears to be partly infeasible and partly too complicated as the number of rounds  of the game increases. Complexity shows on the shape of the maximum total loss, which is a continuous curve with lots of kinks, i.e. points of sudden slope change.

### 1.5 New contributions

In summary, in our past work (a) we have provided numerical results and examples, (b) we have shown that the optimal binary penalty scheme is greedy, and (c) we have provided accurate results for games of short duration (up to three rounds).

In this paper we have more or less completed the analysis of the aforementioned problem:

1. We show how a game of any number of rounds can be analyzed accurately, by reducing its solution to a single variable optimization problem.

2. We calculate the distance between the optimal solution and a greedy binary solution, and we show that the binary solution is actually a very good approximation.

3. We describe the general “nature” of the optimal solution, which is greedy if the number of rounds does not exceed a well defined limit, but past this limit the optimal scheme evolves in an optimized periodic scheme. Effectively, the complexity of the problem is polynomial.

4. We give an explicit solution if the player starts Hedge with equal arm weights.

In summary, although some minor details can still be filled in, we give a full solution to the problem of the worst performance of Hedge when the multi-armed bandit functions under a limited penalty budget.

Finally a note on the methodology used in this paper: In order to calculate the worst performance of Hedge we try to determine the bandit machine behavior that is most unfavorable to the player. This approach is consistent with the so called “adversarial analysis” of online algorithms. According to S. Irani and A. Karlin (in section 13.3.1 of [Hoc96]) a technique in finding bounds is to use an “adversary” who plays against an algorithm and concocts an input, which forces to incur a high cost. In our analysis the adversary tries to maximize Hedge’s total loss by controling the penalty vector.

Although an “adversary” is a fictional methodological aid for the aforementioned analysis, there are applications and problem interpretations that make use of a real adversary. In [AL14] we have described a collection of such applications, e.g. network interdiction and border control.

## 2 The loss maximization problem

### 2.1 Problem formulation

The “adversary” controls the bandit machine so as to maximize the player’s losses. A naive adversary would probably use the greedy approach, i.e. the adversary would maximally penalize the option with the highest bet, i.e. in round

the adversary would set for the option , such that . We shall see later that this approach is less naive than expected. However, in principle a more sophisticated adversary would be expected to solve an optimization problem that takes into account the whole duration of the game. We assume that the “adversary” is aware of the fact that the player is playing Hedge and is also aware of the value of the adjustment speed parameter . Then the adversary has to solve the following optimization problem [AL14]:

###### Problem 1

Given a number of options , an initial normalized weight vector , and a Hedge parameter , find the sequence , , that maximizes the player’s total cumulative loss

 LH(β)=T−1∑t=0N∑i=1ptiℓti (5)

where is the penalty vector in round (), such that , and the penalty weights in round are updated according to Hedge, i.e.

 wti=wt−1iβℓt−1i,pti=wti∑Ni=1wti(t≥1) (6)

for and

Clearly the objective function (5) is a function of the initial weights , the variables , and . Due to the normalization of weights and penalties there are in total independent variables. The player chooses the independent initial weights, and , and lets Hedge make all future decisions. The independent penalty variables are under the control of the bandit machine, i.e. the adversary. In the following we use or instead of whenever it is necessary to refer to these variables.

For the adversary problem 1 is a well defined maximization problem. Analysis should aim at finding the maximum value of , as given by (5), together with the penalty vectors , , that achieve the maximum. We show that an iterative solution is possible, but it hardly produces any closed form solutions. We shall also see that numerical results for as a function of the initial weights. The analysis of short games, i.e. games with a small number of rounds , as given in [AL15], has shown that the optimal strategy of the adversary soon ends up with extreme penalty values, which in this case become binary due to normalization. This means that the optimal penalties, from the adversary’s viewpoint, may take values between zero and one in the first few rounds, but soon the adversary is obliged to place all the available penalty (i.e. one unit) into a single option in each round. On the other hand, we have shown in [AL14] that the optimal strategy, under the assumption of binary penalties, is greedy (and simple to calculate). A straightforward conclusion that follows from the greediness is that the optimal binary strategy is also periodic. This means that the adversary assigns a unitary penalty to a different option in each round in a cyclic manner, the cycle consisting of rounds, where is the number of options (provided of course that the game duration is long enough to accommodate more than one cycles).

## 3 First results

In this section we reiterate some of the so far known results. Firstly, we give an outline of the recursive properties of the optimal solution to Problem 1. Then we use these properties in producing an initial set of numerical results.

### 3.1 Iterative properties of the optimal solution

If in the current round the player (guided by Hedge) bets according to a weight vector , and the adversary generates penalties , then in the next round the player will use weights where

 Wi(v,ℓ)=viβℓi∑Nj=1vjβℓj(i=1,2,…,N) (7)

The total loss of a round game, which starts with weights , can be written as the sum of the losses of a single round game, which starts with weights , and a round game, which starts with weights , i.e.

 LT−1(w;ℓ0,ℓ1,…,ℓT−1)=L0(w;ℓ0)+LT−2(W(w,ℓ0);ℓ1,…,ℓT−1) (8)

Assuming that the solution to Problem 1 is

 LT−1max(w)=maxℓ0,…,ℓT−1LT−1(w;ℓ0,…,ℓT−1)

the following iterative formula for can be derived from (8):

 LT−1max(w)=maxℓ[L0(w;ℓ)+LT−2max(W(w,ℓ))] (9)

where is the penalty vector chosen by the adversary in the initial round.

The associated optimal penalties can also be computed iteratively. Let , where denotes the optimal penalty of the ’th option in the ’th round of a round game (starting with weights ). Then, the optimal penalties are given by the following formulas:

 λT−1;0(w) = argmaxℓ[L0(w;ℓ)+LT−2max(W(w,ℓ))] (10) λT−1;t+1(w) = λT−2;t(W(w,λT−1;0(w)))(t=0,1,…,T−2) (11)

More details of the derivation together with sample calculations can be found in [AL14].

### 3.2 Numerical results based on the iterative solution

The numerical calculation is based on sampling the initial weight vector . For example, if , then the initial weights are, say, and , and . The calculation starts with for samples of , and continues for The accuracy of the outcome is conditional on the sampling resolution, i.e. on , and a sufficiently accurate computation for a game with more rounds requires an increased resolution. We give an example of the outcome of such a computation in Fig. 1, in which we have used a rather exaggerated value of in order to expose the shapes of the curves more clearly. The curves are non linear and exhibit lots of sharp points. An example of the calculation of the associated penalties is given in Section 6.

Note that formula (9) does not imply that an optimal adversarial strategy for a round game is the same as the strategy of the first rounds in a round game, due to the fact that given an initial vector of weights, enters the computation for with a different vector of weights than the initial vector. In other words, the total number of rounds affects the choice of penalties in the initial rounds too.

### 3.3 Initial observations on the expected solution

The obvious attack to the problem is to directly use(9) (and (11) for the penalties). In [AL15] we have provided detailed results for option games with 1, 2, and 3 rounds (). The results clearly illustrate the complexity of the analytic approach, mainly due to the fact that there is a population of sharp points of the function that grows with the number of rounds, and requires very long formulas for the calculation of the values of that produce the sharp points. Also, these results can possibly give some initial indications on the nature of the expected solution to the general problem.

###### Example 1

Two and three round games starting with equal weights: Consider the case , which means that the player begins the first round by equally distributing the bet to the two available options. Note that the choice is optimal for the player, as indicated by the results in [AL15] and also by the numerical results of Fig. 1. In a single round game the loss outcome for the player is , i.e. constant regardless of the adversary’s choice of penalties and . In a two round game the analysis in [AL15] shows that the adversary should use penalties in the first round and in the second round (i.e. to put all the available penalty to the first option in the first round, and to the second option in the second round), or vice versa. By using this strategy the adversary will inflict on the player a total loss equal to . So far the optimal penalties are binary.

However, in a three round game the optimal policy for the adversary is or vice versa, i.e. . Both choices give a total loss equal to . This time the first round penalties are fractional, while the 2nd and 3rd round penalties remain binary as before. This example shows (a) that the optimal penalties can be fractional and (b) that the optimal penalties of any round depend on the total number of rounds.

###### Example 2

Periodic extension of the previous game: At this point let us make a few observations on the possible extension of the aforementioned two and three round schemes, which can serve as an introduction to periodic schemes. In Table 1 we show the loss that results from the periodic extension of the two round scheme, i.e. from the cyclic repetition of the penalties chosen by the adversary in the first two rounds. We present the next two rounds only, but the scheme could be cyclically extended to an arbitrary number of rounds. If this game is repeated for an even number of rounds, the average loss per round is equal to

 12(12+11+β)

Note that the choice of penalties could also be the outcome of a greedy algorithm for the adversary. Next, in Table 2, we give the periodic extension of the second and third round of the three round example, i.e. we exclude the very first round from the periodic scheme. Note, again, that this could be a greedy scheme for the adversary. The loss per round in this second example (excluding the very first round) is

 11+√β

and it is easy to prove that for . Thus, the cyclic strategy of Table 2 is superior to the cyclic strategy of Table 1 (from the adversary’s viewpoint). Effectively, by resorting to a fractional penalty in the first round, the adversary is able to improve the long term behavior.

### 3.4 The binary case

Analytic results from short games and numerical results from arbitrary length games have provided strong indications that the optimal penalties (from the adversary’s viewpoint) quickly become binary and remain so for the rest of the game. However, given the assumption that the sum of penalties equals one, an additional binary penalty assumption implies that only one penalty component is equal to one, while all other components remain equal to zero. Effectively, the adversary is able to (maximally) punish exactly one option in each round.

We have shown in [AL14] that a binary penalty assumption results to (a) a greedy optimal penalty scheme for the adversary, which further implies that (b) the optimal scheme ends up in being a rotating scheme (for all rounds , such that for some ). The optimal scheme for the adversary is to assign a penalty equal to one to the option with the highest bet. In a set of initial rounds this strategy results into a near equalization of the bet distribution (over the options), since in each round the adversary always inflicts a unitary penalty on the option with the maximum (), effectively driving Hedge to push the option weight downwards. Due to the quantized nature of the weight adjustments, the bet components (weights) never become exactly equal, but they become almost equal to each other and to within a margin. Then in each round the currently maximum weight is pushed below all other weights. Moreover, in this second phase the greedy adversarial algorithm effectively becomes a rotating scheme.

## 4 Rotating schemes

This is a brief section on rotating schemes, which serves as a preparation for the analysis of games that are long enough so as to include a rotational phase in their optimal adversarial scheme.

### 4.1 Periodic penalties will produce periodic weights and periodic losses

While the Hedge algorithm aids the player to adapt the bet mixture so as to avoid the penalties imposed by the adversary, the latter may try to rotate the penalties, so as to make adaptation difficult. In this section we explore the dynamics of Hedge under rotating penalty schemes. Hedge responds to a penalty rotation with weight rotation, as shown by the following lemma, which is valid for an option game:

###### Lemma 1

Assume a rotating penalty scheme

 ℓti+1=ℓt−1i(i=1,…,N−1),ℓt1=ℓt−1N (12)

i.e. option at time inherits the penalty of the previous option in the previous round . Then the response of Hedge generates a loss in each round that is also periodic, i.e. it repeats itself after rounds.

Note that (12) implies that penalties repeat themselves after rounds:

 ℓN+ti=ℓNi(i=1,…,N)(t=0,…,T−N) (13)

and in general

 ℓτi=ℓ0τ+imodN (14)

The proof of the above lemma is straightforward: Since

 wt+Ni=wtiβ∑N−1k=0ℓt+ki=wtiβ∑N−1k=0ℓ0t+i+kmodN=wtiβ∑N−1k=0ℓ0k=wtiβΛ

where is the sum of all the penalties in the initial or, due to the rotation, in any other round, and due to the normalization

 pt+Ni=pti

In fact, a periodic scheme would prevent Hedge from focusing on a single option. In Example 9 (see Appendix A) the net loss performance is calculated for binary penalties.

### 4.2 Periodic weights for maximum loss

As explained in section 4, the adversary may use proper penalties with an aim to drive Hedge towards weights that will maximize loss in the rotation phase. Currently we calculate the optimal weights without exploring how the adversary might possibly force Hedge to adopt these weights. The latter issue is examined later in this paper.

We simplify our calculations by exploring a two option game. Let denote the total loss per cycle, i.e. the cumulative loss in two consecutive rounds. Assuming that periodic behavior will start at , which we set equal to 0 without loss of generality, let the pair of weights at . Assuming (also without loss of generality) that (which also implies that ), the adversary will choose penalties , therefore the new pair of weights will be

 (wβwβ+1−w,1−wwβ+1−w)

In the next round the adversary will choose penalties , and the new pair of weights will be again. The cumulative gain of the adversary (loss of the player) in both rounds will be

 Lp(w)=w+1−wwβ+1−w

which is maximized w.r.t. for

 w=11+√β

 (11+√β,√β1+√β) (15)

and by imposing a pair of penalties the pair of weights

 (√β1+√β,11+√β)

will appear in the next round, and again the adversary will return to the original pair by using penalties . The cumulative loss in these two rounds will be

 Lp=21+√β

which for the adversary amounts to a constant gain per round.

The pair is for the adversary an attractive pair of weights given a sufficient time horizon , so that this rotational steady state can be reached. However these values can in general be achieved by the adversary only by using fractional (i.e. non binary) penalties in at least one round.

A straightforward generalization for options gives the optimal weights as

 w∗i=(1−β1N)βi−1N1−β,i=1,…,N

and the loss per period of rounds is equal to

 N(1−β1/N)1−β

## 5 Solutions with binary penalties

It is possible to find a suboptimal solution by constraining the solution to binary penalties and using a greedy algorithm. In this section we summarize certain results of previous work [AL14].

First, we reiterate the main idea in the analysis of binary penalties: If the current weights are , a pair of penalties would transform the weights to

 (wβwβ+1−w,1−wwβ+1−w),

while penalties would produce weights equal to

 (ww+(1−w)β,(1−w)βw+(1−w)β)=(wβ−1wβ−1+1−w,1−wwβ−1+1−w)

Let us define

 f(w,x)≡wβxwβx+1−w, (16)

but we drop whenever it is obvious and write . By using this notation, in one round the first option weight moves from either to , or to . In rounds, if the number of first option penalties that are equal to 1 is , the weights before normalization are , therefore the first option weight is

 wβkwβk+(1−w)βn−k=wβ2n−kwβ2n−k+1−w

Effectively, each new round brings a move on such that increases or decreases by one (penalty) unit. In each move the adversary has the option to increase the player’s loss (i.e. the adversary’s gain) by by moving to , or increase it by and move to . The rest of the analysis in [AL14] explains why this walk on the curve can be optimal for the adversary by making greedy choices (i.e. move to if , otherwise move to ). The main idea in this proof is that each forward move (towards ) is for the adversary a high gain move if , but it decreases the weight, thus it also decreases the possible gain in the next moves. However, we have proven that any attempt of the adversary to “invest in the future”, by currently accepting a lower gain (equal to instead of ) and by moving backwards, will never be able to produce adequate future gains that will justify the current sacrifice. This result is summarized in the following lemma:

###### Lemma 2

The optimal solution with binary penalties is greedy.

## 6 How to calculate the optimal penalties

### 6.1 Introductory remarks and an example

Firstly we give some numerical results using expressions (9), (10), and (11).

###### Example 3

We set and and produce the total cumulative loss numerically by using formulas 9, 10, and 11. We have quantized the possible values of the (first option) initial weight to values equal to , where ; therefore the total cumulative losses, the penalties etc as functions of are vectors of 10001 components. The resulting total cumulative loss vs is shown on Fig. 2, while Fig. 3 shows the (first option) penalties of different rounds that have led to this result. Let us summarize a few initial observations:

1. As expected, there is an even symmetry for the total cumulative loss, i.e.

and an odd symmetry for the associated penalties, i.e.

.

2. Non binary penalty values occur mostly in the first round. There is also an occurrence in the second round, but not in both rounds in the same game. The non binary values are limited to distinct small areas of in the first round. Both the lengths of these areas and the aberrations from the binary values are quite limited. Some of these areas have been marked with vertical lines at both ends.

3. For very close to 0 or 1 all penalties are equal to 0 or 1 respectively, while for very close to there is a pattern of alternating binary penalties. Clearly, there is a first phase of high gains per round, in which the adversary tries to take maximum advantage of the largest weight by always penalizing the same option, and a second phase, in which the weights oscillate between two values that are close to . Therefore, in this second phase the adversary cannot extract an average per round gain significantly higher than . The length of the first phase depends on the initial weight (and on ). The closer the value of is to , the sooner the game enters the phase of rotation.

We name the first phase “the transitional phase”, and the second phase “the rotational phase”, and we define the precise limit between the two phases later on.

Assuming a weight (for the first option) at some round, if the adversary chooses a penalty such that (for the first option, and for the second option), the player’s loss in this round will be equal to , and the new (first option) weight that will appear in the next round is equal to

 wβxwβx+(1−w)β1−x=wβ2x−1wβ2x−1+1−w

In each round the adversary can choose any value of between 0 and 1, but (assuming that ) the new weight will be smaller or greater than depending on whether or respectively, and will remain the same if .

In general, if the sum of first option penalties is equal to in the next rounds, the pre-normalized weights will be , and the final first option weight will be

 wβxwβx+(1−w)βn−x=wβ2x−nwβ2x−n+1−w=f(2x−n),

where is defined by (16).

### 6.2 The transitional phase

We use the above observation to prove some useful minor theorems. The first one describes a situation, in which the initial weights are unequal, and their relative size cannot be reversed in the next rounds, i.e. (). Under this assumption we show that in a game with rounds the optimal strategy of the adversary is the greedy strategy, i.e. to set all first option penalties to 1, instead of using any fractional penalties. First, we give the following lemma:

###### Lemma 3

If , , is such that , and , then for (as defined by (16)) the following inequality is valid:

 (1−ϵ)f(0)+ϵ[1−f(0)]+n∑k=1f(k−2ϵ)

Effectively, in a game of rounds starting with such that (i.e. the number of rounds is not sufficient to reverse the balance of weights between the two options after the end of the game, even if the first option is always maximally penalized), the optimal policy of the adversary is the greedy one, i.e. to use unitary penalties in all rounds ( for ). Although the first part of this lemma strictly states that a non binary penalty should not be chosen in the initial round, it can also be used to exclude a non-binary step in any intermediate round if we assume that the game starts exactly at this round. In addition, it will be shown in a later lemma that the earlier a “sacrifice” (i.e. a deviation from the greedy policy, which gives a maximum short term gain to the adversary) is, the more effective it is for the adversary. Therefore any sacrifice should be undertaken in the very first round. The above lemma can be proved by induction. However, we have omitted the proof.

We shall give an interpretation of the above lemma and a few additional details. Consider Fig. 4.

The initial weights are . If the adversary implements the greedy strategy continuously for the next rounds, the first option weights are , where , and . Therefore, under the greedy strategy the first option weights are (i.e. the white dots) and the player’s total loss (in rounds) is . On the other hand, the adversary could choose to make a sacrifice in the first round, so as to imcrease loss in the remaining rounds as shown by the black dots. Effectively, Lemma 4 states that the sacrifice in the first round cannot be counterbalanced by the improved weights in all subsequent rounds. Note, however, that Lemma 4 holds if the initial weight is such that , i.e. if the weight that would appear in a ’th round (that does not exist) would still be above .

If , but , the area in which rotational schemes are possible is reachable by the adversary by continuously using penalties equal to 1 until the weight of option 1 falls below . In the rest of the game the adversary could use an optimized rotation scheme. In order to reach this optimized rotation the adversary may need to make a fractional penalty sacrifice at some point. Assuming that such a non greedy step should be taken, the question of its timing emerges. Should the non greedy step be taken just before entering rotation or perhaps earlier? A lemma that will soon be introduced states this step should be executed as early as possible.

Before dealing with the problem of optimally entering the rotation phase, we shall briefly examine a marginal case, in which the weight of the final round gets so close to , that it would fall below if an additional round existed. Effectively, this implies for the initial weight that , but , therefore Lemma 3 would not be valid. The optimal policy of the adversary is to make a non binary adjustment in the first round, which will improve the gains of the greedy moves in all subsequent rounds. Therefore, in this marginal situation it can be proved that the optimal set of penalties is . The correct value of can be found by using direct optimization of the resulting single variable objective function for the total cumulative loss, and it approaches zero as moves from towards . In fact, extensive numerical tests provide strong indication that there is a marginal value such that the optimal is zero if , but it is non zero as becomes lower than and it grows linearly as approaches the lower end of this area, i.e. . Numerical tests also show that for the optimal is around regardless of the value of . Effectively this means that given an initial weight in the interval towards , the adversary will need to maximize the total cumulative loss w.r.t. , and the expected outcome is between zero and . However, the analysis given in the above few lines is probably too detailed and all that is necessary is the following lemma, which is given without proof:

###### Lemma 4

If , there is an , such that , for which the optimal policy of the adversary is achieved by setting , and in the remaining rounds

Effectively, if the assumptions of Lemma 4 are true, the inequality (17) may be reversed for some . Note, however, that the weight that enters the last round is still above , therefore Lemma 3 and Lemma 4 are separated by a very thin line. Both lemmas assume that the (first option) weight will never fall below at any round, but Lemma 4 states that special care is needed in the first round if the final round weight is likely to come very close to .

The last lemma of this section states (as already promised) that if a non binary penalty must be used, it should appear as early as possible. The lemma compares two scenarios: In the first scenario the fractional step precedes a sequence of successive unit penalty steps, while in the second scenario the fractional step follows the unit steps. The loss in the first scenario (with an analysis similar to the analysis of Lemma 3) is in the first round, and , () in the following rounds. This gives a total loss equal to

 f(0)(1−ϵ)+(1−f(0))ϵ+n∑k=1f(k−ϵ)

In the second scenario the fractional penalty is used in the last round, thereby giving a total loss equal to

 n−1∑k=0f(k)+f(n)(1−ϵ)+(1−f(n))ϵ.

The proof of the lemma can be produced by induction, starting from two rounds ().

###### Lemma 5

If , , is such that , and , then for (as defined by (16)) the following inequality is valid:

 n−1∑k=0f(k)+f(n)(1−ϵ)+(1−f(n))ϵ

Note that both scenarios of Lemma 5 produce the same final weight ; this is not true for the previous lemmas in this section. The final weight is greater than due to the assumption . The final weight could serve as a target weight; for example it could be set equal to the ideal weight or some other suitable value. Lemma 5 states that the non binary penalty step should be taken as early as possible.

We are now about to examine games that are long enough to produce (first option) weights lower than . Before stating any additional lemmas we define the “intersection area”: This is an area of weights around , and includes any weight from which is reachable in a single round. Effectively this implies a penalty equal to 1, if the starting weight is greater than , or a penalty equal to 0, if the starting weight is lower than . Therefore the upper limit of this area is a weight , such that , which gives , and the lower limit is , such that , which gives . Thus the “intersection area” is the interval .This area obviously includes the optimal rotational weights and , which are reachable from one another with binary penalties.

### 6.3 Entering the rotational phase

If the adversary cannot reach a weight below ever in the game. According to the previous lemmas, if the initial weight is somewhat greater, i.e. it is such that (which implies ), then the optimal policy of the adversary is to use a sequence of penalties equal to . If, however, holds marginally so that , the adversary might benefit from using a non binary first round penalty. Finally, if , a penalty sequence equal to will eventually bring the weight below . In the rest of this paper we shall see that the adversary’s optimal plan is roughly to bring the current weight () as fast as possible from the initial weight to a value close to , and then produce weights that rotate around (by using a sequence of alternating 0 and 1 penalties). However, as devil is in the details, we shall also see that there is the issue of optimally approaching the most suitable rotating weights, while both the optimal approach and the most suitable weight values depend on the total length of the game. All these issues will be explored by using a number of lemmas.

The optimal weights are not necessarily those given by (15). Assume that the latest weight achieved by a so far greedy adversary is and that is about to enter the intersection area, in the sense that in two rounds the weight could plunge below 1/2, i.e. it is such that , although (i.e. one round is not enough to reach ). Suppose also that somehow the adversary has computed as the ideal weight for the future, and intends to approach as soon as possible. For example, under certain conditions the ideal weight could be equal to . If by any chance , then the adversary can (still be considered “lucky” and) set the next penalty equal to 1 and reach in one round. If for some small , the adversary can still consider himself “lucky”: He can use a penalty for some small suitable , and reach the desired weight by sacrificing a small part of the gains. If, however, , the weight is unreachable in just one round. The adversary could possibly use penalties or , in order to reach , but a penalty equal to is implies a significant sacrifice (as compared to a penalty equal to 1 applied on the greatest weight). A better solution is to let the first option weight fall below by applying a penalty vector in the next two rounds, and then approach the target weight from below by penalizing the now stronger second option weight with a penalty . The next two lemmas state the appropriate policy when the adversary is about to approach to a target weight .

###### Lemma 6

Assuming that a game starts with , and is a target weight such that , and the adversary’s intention is to achieve within two rounds, and are such that , , , , then the adversary should prefer the penalty vector over the (less greedy) penalty vector . In other words

 wx+(1−w)(1−x)+(1−w)β1−xwβx+(1−w)β1−x> wx1+(1−w)(1−x1) + x2wβx1wβx1+(1−w)β1−x1+(1−x2)(1−w)β1−x1wβx1+(1−w)β1−x1.□

This situation is shown in Fig. 5a. Both weights and are above , and both are in the intersection area. The lemma states that the adversary should go for the maximum possible penalties in each round; this implies in the second round, and the maximum possible penalty in the first round, such that will be achieved after two rounds. The proof is trivial, since in both rounds the preferred policy brings a higher gain than the policy . Next comes the lemma complementary to the previous one for , as shown in Fig. 5b.

###### Lemma 7

Assuming that a game starts with , and is a target weight such that , and the adversary’s intention is to achieve within two rounds, and are such that , , , , then the adversary should prefer the penalty vector to the (less greedy) penalty vector . In other words

 w+xwβwβ+1−w+(1−x)1−wwβ+1−w> wx1+(1−w)(1−x1) + x2wβx1wβx1+(1−w)β1−x1+(1−x2)(1−w)β1−x1wβx1+(1−w)β1−x1.□

However, the value of the aforementioned sacrifice (or investment) has also to be weighed against the expected gains from the target weights, whose values depend on the remaining rounds. The more the remaining rounds are, the better the justification of approaching the ideal rotational weights is.

###### Example 4

Assume , and a very large number of rounds . In this case the adversary would pursue the pair of weights for the rotational phase. The penalty brings the weight to