## I Introduction

Markov decision processes (MDPs) represent a mathematical framework for dealing with decision problems in many fields. It is usually hard, however, to predict how changes in the underlying MDP might affect the performance of a given policy, especially if changes happen online. For example, when the observation model is not accurate due to corrupt/noisy observations from damaged sensors, or the dynamics model becomes suddenly inaccurate due to underlying changes of the system, it is challenging to know whether the current policy will perform in a satisfactory manner. Trying to re-learn a policy online using reinforcement learning may be an option, even though the large amount of data required to update the parameters of the model might be prohibitive

[Minh_2013, Schulman_15_1, Schulman_15_2]. Other ways to circumvent this problem may be to use noise at training time [basu2017learning, Dodge2016] or use some filtering procedure [duan2010highly, jassim2013image]. In this work we address this problem through the use of multi-armed bandits.The multi-armed bandit problem represents one the most simple settings to study the exploration vs. exploitation trade-off. The literature on multi-armed bandits is rich and contains a myriad of strategies for dealing with the problem, along with guarantees of performance in the form of regret bounds [lai1985, Robbins1952, abbasi2011, Bianchi2012, jin2018qlearning, TorCsabaBandits]. In this work, we show that the task of choosing among a set of expert policies during the execution of a MDP, can be likened to a multi-armed bandit problem. In contrast to classical multi-armed bandits, there is strong coupling between the states and the rewards of the MDP, so one cannot freely ‘switch’ between expert policies without repercussions.

The primary focus for this paper will be to present the experimental results of the algorithm on a low-dimensional and a high-dimensional MDP setting, while deferring most of the mathematical details to [Mazumdar2017]. For the low-dimensional scenario, we will use a gridworld environment where different agents are trained under various dynamics models to reach a set of high-reward states. For the high-dimensional case, we will use the Seaquest video game environment, shown in Fig. 1, where each expert policy is trained under different observation models. Unlike the gridworld, agents trained in this environment receive state information through raw images from the game. Despite the high-dimensional observations, the multi-armed bandit algorithm is capable of selecting the best expert from our set at run-time to minimize regret and outperform experts specifically trained to be robust.

## Ii Background

The multi-armed bandit problem was introduced in [Robbins1952], and has been an active field of research since. In [lai1985] the classical Upper Confidence Bound (UCB) algorithm was introduced as a solution with logarithmically growing regret. More recently, multi-armed bandit (MAB) algorithms and the notion of regret have been used for the control of systems with dynamics. For example, in [Kakade2004] a MAB formulation was used to train reinforcement learning agents, and [Auer2009] constructed bounds on the regret of RL algorithms. Many other algorithms have guarantees on the growth of the regret; for an overview of some of the work see [abbasi2011, Bianchi2012].

In [Cesa-Bianchi2006], an overview is given of various bandit approaches for choosing an expert from a set of experts online under many different assumptions on the underlying structure. The idea of using pre-trained experts for control has also been investigated in various settings. [Hamrick2017] used the notion of experts for meta-control of MDPs when there is a fixed computational cost and [Laskey2015] used experts in conjunction with recommendation algorithms for choosing which actions to take in an uncertain environment.

The idea of expert selection in MDPs is also similar to the idea of restless bandits, where the underlying distribution of each arm evolves according to separate Markov processes that are independent of the pulls [Bubeck2012]. Recent work has been conducted in deriving bounds on the regret of varying algorithms including the UCB algorithm in this setting [Ortner2012, Tekin2012]. The critical difference in this case lies in the fact that in our setting, each pull of an arm (equivalently choice of an expert) effects the underlying dynamics which are also shared across all arms. Another related line of work seeks to use multi-armed bandit-inspired algorithms for reinforcement learning problems, such as the Q-Learning UCB approach analyzed in [jin2018qlearning].

Thus, while MAB approaches for expert selection in MDPs tend to focus on isolated expert executions as the metric of performance for each bandit, in this work we define a sequence of time horizons and equate the action of pulling the arm of one of the bandits, to executing a specific expert on the MDP for steps. In our work we devise a version of the UCB algorithm and show that this online expert selection paradigm, despite violating the assumption of independence between successive pulls, can be seen as an “almost bandit” problem. We then proceed to test the algorithm on a low-dimensional gridworld environment and a high-dimensional video game environment.

## Iii Problem Formulation and Analysis

In our formulation, the controller is an algorithm that selects an expert policy to execute during the evolution of the underlying MDP. The chosen expert is fed observations of the state of the system, and outputs a distribution over the actions to take. The controller then samples the action, takes the action, and receives a reward from the environment.

The controller is agnostic to the true state of the environment as well as the underlying dynamics. The only information available to the controller is the history of rewards it has received for each expert. This problem lends itself to a multi-armed bandit formulation, although, as we will shortly see, it is significantly different from a classical multi-armed bandit problem.

In this section, we introduce an Upper Confidence Bound (UCB) algorithm applied to expert selection in MDPs. We then provide some intuition for the algorithm and discuss its performance in terms of regret.

### Iii-a Setup

Before discussing the particular algorithm used to solve the problem, we introduce some notation. The environment of interest is modeled as an MDP, which is represented as a 6-tuple , where denotes the state space, , the action space, the state transition kernel, , the reward kernel, the discount factor, and the initial state distribution. Note that for each action , and , where

denotes the set of all probability distributions over the set

. Further, we assume that there is an unknown observation kernel that, for a given state, returns a probability distribution over the observation space

. Lastly, we are given a set of experts , where .### Iii-B Upper confidence bounds in MDPs

In this section we provide the upper confidence bounds for expert selection in MDPs. For further details regarding the derivations we direct the reader to [Mazumdar2017].

At each round the controller chooses an expert and follows its policy for an episode of length . The controller then observes the average reward over the episode. Letting , we define the average reward for expert over the

-th episode as the random variable given by:

(1) |

where . The controller keeps track of the average across all the times an expert has been chosen, which for simplicity we denote as . Thus, if the algorithm has been run for rounds and expert has been chosen times, is given by

(2) |

where is the indicator function. After having chosen an expert , times, can be shown to be close to

(3) |

which represents the average reward under the stationary distribution for expert . Given these definitions, the following relation holds between and :

(4) |

Here, the constant depends on the expert’s policy and the underlying dynamics of the MDP, and is a chosen parameter. We note that adds a constant offset to each experts confidence bound, and as such, this term can be ignored when choosing between confidence bounds. We introduce the following notation to denote this approximate confidence bound for expert :

(5) |

When we implement our algorithm, this upper confidence bound will be treated as if it were the expected reward; this gives rise to an optimism-based approach.

### Iii-C Modified upper confidence bound algorithm

The classic UCB algorithm was introduced in [lai1985] as a solution to the famous multi-armed bandit problem. Here we introduce the UCB algorithm variant for expert selection in MDPs.

Algorithm 1 is initialized with the confidence level , a sequence of episode lengths , and a set of expert policies . The expert policies are pre-trained and considered to be optimized for different settings of the environment. We note, that all policies must map from the same observation space to the action space of the MDP. These policies can be optimized for different noise conditions (modeled as different observation kernels), different tasks (modeled as different reward kernels), or even different dynamics (modeled as different state transition kernels).

We begin by initializing the confidence bound of each expert to infinity to ensure that each expert is chosen at least once. The algorithm works by choosing the expert policy for which it believes it can receive the highest steady-state reward, which as per the optimism-based approach will be the expert with the highest upper confidence bound. At each iteration , the algorithm chooses an expert policy to execute for time steps. It collects the reward across the

steps, updates its estimate of the expert’s steady-state reward, and then updates the confidence bound of the expert,

accordingly. The confidence bound is the maximum distance that the estimate can be from the true value of the average steady-state reward of the expert. This bound holds with probability at least . Thus in practice, is often chosen to be small.As the algorithm tries out different experts and gathers more data, it tightens the confidence bound. By choosing the expert that has the maximum upper confidence-bound (Line 9 in Algorithm 1), we get an optimism-based algorithm that picks the best guess at each time-step. The rationale for an optimism-based approach is that if the expert is suboptimal, the algorithm will eventually pick a different expert whose upper bound is better.

In contrast to the classical literature on multi-armed bandits, when we switch between experts, there is a strong coupling in the trajectory of the MDP: when an expert first takes control, the current state of the system is a consequence of actions taken by the previous (possibly different) expert. As shown in [Mazumdar2017]

, in this framework, if an expert is used for a sufficiently long period of time, ergodicity implies that the average reward seen will be close to the average steady-state reward of that expert. This is made explicit in the following corollary, which follows from basic properties of finite Markov chains.

###### Proposition 1.

Suppose the induced Markov chains for each expert are irreducible and aperiodic. Pick any , and let be the expert chosen at the th iteration. We have that:

(6) |

We remark that

is a constant that depends on the second largest eigenvalue of the induced Markov matrix of the expert, but not on the dimension of the state or action space. Given this property, it can be shown that this UCB-based algorithm for expert selection enjoys strong theoretical performance properties measured in terms of regret, which we define next.

In order to gauge the efficacy of the UCB algorithm it is necessary to introduce the idea of regret. We first point out that for regret to be practical, we need to calculate it with respect to a reasonable baseline. Let . Thus is the best steady state average expected reward among all of our experts. We define the regret of the algorithm until iteration to be

(7) |

where , as before is given by . By virtue of corollary 1, algorithm 1 has the property that its regret is logarithmic in the time horizon for a properly chosen sequence length of episodes such that for all . Indeed, let us define the expected regret of expert as . Note that . Then it can be shown that the UCB algorithm for expert selection in MDPs has the following regret guarantee:

###### Proposition 2 (Regret bound for algorithm 1).

For a set of experts and initial time horizon such that for all , and for a choice of , the expected regret of the UCB algorithm after iteration steps, , satisfies:

(8) |

where and . For brevity we direct the reader to [Mazumdar2017] for the proof of the above proposition which follows from standard proof techniques in the multi-armed bandits literature and makes use of corollary 1 to bound the bias between the finite-time approximation to the steady-state reward of an expert.

## Iv Experiments

In this section we present two numerical experiments to validate the proposed method. The first is a simple low-dimensional gridworld where each expert was trained with a different dynamics model, and the second is a game environment with an extremely high-dimensional obervation space where the experts were trained under different observation kernels in the form of occlusions.

### Iv-a Low-dimensional Gridworld

We first test the UCB algorithm outlined in Section III on a gridworld example. The goal of the “agent” in the gridworld MDP is to maximize the cumulative reward by reaching a set of goal states. We first describe the dynamics, and then the reward structure of the grid.

The grid and setup used are shown and described in Fig. 2A. Each state, in this MDP, is a tile of the grid, and the agent begins in the white tile. The action space is . Each action corresponds to a move in one of the four cardinal directions. The dark blue states are almost-trapping states, where with probability 0.98 the agent is stuck in that state, and with probability 0.02, the agent moves in the direction of their chosen action. For all other states, the agent follows its desired action with probability 0.97, and goes in a random other direction with probability 0.03. In states along the edge of the grid, an action going out of the grid will result in a movement in a random direction back into the grid.

The rewards in the grid are as follows: the green squares both give a reward of 1, while the gray squares give rewards of 0.1. Dark blue squares give a reward of 0. Given this setup, the optimal policy would have the agent move around the edge of the grid, avoiding the dark-blue states, and then go back and forth between the two green states.

The agent, in this problem, is supplied with a collection of expert policies , and no knowledge of the underlying true dynamics. Each expert policy was trained on the same MDP structure but with different dynamics models. For example, for experts and , the action corresponds to a movement ‘up’ in the grid, whereas for experts and , action is ‘up’. The map from actions to movement that each expert was trained under is shown in Fig. 2B.

Given this grid-world MDP and collection of experts , we now run the UCB algorithm with episode lengths , where we set for all the experiments and keep as a variable for now.

We note that the MDP is simple by construction, and we use it more for illustrative purposes than to show the algorithm at work on a difficult task. Further, we note that the confidence bounds and sequence of time horizons we use for this experiment are those used in the derivation of the regret bound in [Mazumdar2017]. There are choices that can possibly yield better qualitative results, but for the sake of consistency we stay with the same choices as in the derivation.

Given the setup in the prequel, we now show the expected regret of the UCB algorithm. We run the algorithm 10 times for various values of and plot the average regret in Fig. 2(a). The values of , , and for each expert in can be found in [Mazumdar2017]. We note that expert is the best expert.

We can clearly see in Fig. 2(a), the logarithmic rate of growth of the regret. Further, we can see how long time horizons lead to lower regret. This most likely occurs because longer time horizons allow the average cumulative reward from each choice of expert to better approximate the average steady-state reward of that expert. In Fig. 2(b), however, from a cumulative reward standpoint, longer initial time horizons lead to a much slower rate of convergence to the optimal average cumulative reward. This is due to the fact that querying the wrong expert results in a higher corresponding loss of reward at each time step.

We again note that the iteration step of the algorithm is an artificial time-scale imposed over the problem to be able to formulate the problem as an MAB problem. Thus, the trend that a longer time horizon has lower regret after a fixed number of algorithm iterations is not necessarily a valid comparison, since correspondingly more time has elapsed in terms of the MDP. Indeed, given the results in Fig. 3, and with our stated goal of maximizing the cumulative reward, we would, perhaps counter-intuitively, choose the initial time horizon

despite the fact that it incurs a larger regret in terms of the algorithm. We note however, that lower time horizons inherently give samples with higher variance, meaning that this trend may not hold true for all MDPs. Further, we note that choosing a larger

is not asymptotically penalized since it does, eventually, achieve the same average cumulative reward as the lower . Finally we note that all choices of have logarithmically growing regret and the average cumulative reward tends to the average steady-state reward of the best expert in the set.### Iv-B High-dimensional Atari Game

For our second set of experiments we use the Seaquest domain from OpenAI gym [brockman2016openai] shown in Fig. 1. Given the high-dimensional observations of this environment, we chose our experts to be a set of Q-networks as defined in [Minh_2013]

. These action-value networks are functions which take a sequence of images as inputs and output a vector of length

corresponding to the value of taking each action. In this work, each Q-network was trained under a fixed occlusion from our set (occlusions 1 though 4, see Fig. 4) using the algorithm presented in [Minh_2013]. In total, four experts were trained, each against a unique occlusion. After training each expert, the UCB algorithm can be used in conjunction with any occlusion to minimize the regret and select at run-time the Q-network in our set of experts which performs best overall.All high-dimensional experiments were performed on a MacBook Pro 2013 with 2 GHz Intel Core i7 processors. For simplicity, the time horizon for which each expert is executed was held at 100 environmental (MDP) steps and all experiments were performed for a total of 1500 algorithm iterations. With these parameters each run of algorithm 1 took approximately 12 minutes.

In Sec. IV-B1, as a sanity check, we show the performance of the algorithm against the occlusions used at training time. In Sec. IV-B2, we run the algorithm against the test occlusions 5 through 8. Finally, in Sec. IV-B3, again using occlusions 5 through 8, we compare the accrued rewards of the algorithm against experts specifically trained to be robust.

#### Iv-B1 Expert Selection vs. Training Occlusions

In our experiments, we used four different observation models (in the form of occlusions of the input images) to train each of the experts. The set of occlusions used for training are 1 through 4, shown in Fig. 4. These were chosen arbitrarily to cover different areas of the input image. Note that in Fig. 1 the Seaquest domain looks different than in Fig. 4: this is due to a preprocessing step on the raw images.

This first test of the algorithm is performed against all the occlusions from our training set. The expected result is that given our set of four experts and one of the occlusions (1 through 4), the UCB algorithm will pick the expert trained against that occlusion most often. Figure 5 shows the performance of the UCB algorithm for the first four types of occlusions. As expected, given an occlusion from our set, the expert that was trained under that type of occlusion performed better than any of the other experts and was picked most often.

It is interesting to note that for the second occlusion the rate at which the correct expert was picked was slower than the rest, despite the fact that the regret normalized by the iteration step did tend towards zero in all cases. This might be caused by a mix of two factors: the generalization capabilities of the Q-networks, as well as the inherent performance differences between experts. While expert 2 was trained with occlusion two as shown in Fig. 4, expert 1 might have learned to generalize better to other occlusions, thus bringing its 100-step average reward to levels similar to expert 2. The other explanation is that since both experts had full image information for the bottom half of the screen, expert 1’s performance might simply be significantly better than expert 2’s when the submarine is in the lower half, thus making the 100-step average reward similar for both experts and making it difficult to distinguish which one of the two is best.

#### Iv-B2 Expert Selection vs. Test Occlusions

A useful feature of this multi-armed bandit algorithm is that it makes decisions solely based on the accrued rewards, and it is therefore agnostic to the input images. In practical terms this means that the types of occlusions we can use at run-time do not need to be those from our training set. The second row in Fig. 4 shows the set of test occlusions.

Figure 6 shows that for most of the new occlusions, one expert was favored above the rest. For the test occlusions 5 and 7, the algorithm picked expert 3 more often than any other, and for occlusion 8, the algorithm seemed to favor expert 1. An interesting aspect to note is that the experts that were picked more often are also the ones whose occlusions were most similar or had most overlap to the ones they were exposed to at training time. For occlusion 6, the algorithm did not seem to favor any one expert more than the rest within the first 1500 iterations. However, as expected, the two experts that were picked most often were the ones from our set with the highest 100-step average reward denoted with the dashed lines in Fig. 6.

#### Iv-B3 Expert Selection vs. Baseline Experts

For the last set of high-dimensional experiments we test the algorithm against a three Q-network baselines whose training images included occlusions. At training time, every environmental steps, a new random occlusion is produced to overlay the input images. By including these random occlusions during training, the network becomes more robust. Figure 7 confirms this: the Q-networks trained with random occlusions do better on average than Q-networks trained with no occlusions at all. In addition, in three out of the four test occlusions the UCB algorithm outperforms both baselines within 1500 iterations.

Note that the performance of the algorithm is heavily dependent on our set of experts and their ability to generalize. While UCB performed best overall, none of our trained experts had a truly good policy for occlusion 6. The performance could be improved with additional experts trained to deal with occlusions in the right-hand side of the images. In general, having a large and diverse set of experts will perform better than a smaller one. However, having too many experts can also slowdown the algorithm considerably. Finding the right balance between diversity of experts and set size is a very important issue that we will explore in future work.

## V Conclusion

In this work we presented a multi-armed bandit algorithm for the purpose of expert selection in (high-dimensional) MDPs. Given a set of expert policies trained under various dynamics models or observation kernels, we show that one can use a variant of the UCB algorithm from the multi-armed bandit literature to choose at run-time the expert that will do best for the current MDP. In our experiments, we test the UCB algorithm on a simple low-dimensional gridworld and a more complicated, high-dimensional Atari game. Despite the high-dimensionality of the game, our algorithm is able to pick the best expert from a set of experts trained under various observation models. Even when the observation model is different from our original one, our algorithm is still able to select the best expert. We finally compare the algorithm to a baseline of experts specifically trained to be robust to occlusions and find that in general the UCB algorithm still outperforms them.

Comments

There are no comments yet.