ε-BMC: A Bayesian Ensemble Approach to Epsilon-Greedy Exploration in Model-Free Reinforcement Learning

07/02/2020 ∙ by Michael Gimelfarb, et al. ∙ 0

Resolving the exploration-exploitation trade-off remains a fundamental problem in the design and implementation of reinforcement learning (RL) algorithms. In this paper, we focus on model-free RL using the epsilon-greedy exploration policy, which despite its simplicity, remains one of the most frequently used forms of exploration. However, a key limitation of this policy is the specification of ε. In this paper, we provide a novel Bayesian perspective of ε as a measure of the uniformity of the Q-value function. We introduce a closed-form Bayesian model update based on Bayesian model combination (BMC), based on this new perspective, which allows us to adapt ε using experiences from the environment in constant time with monotone convergence guarantees. We demonstrate that our proposed algorithm, ε-BMC, efficiently balances exploration and exploitation on different problems, performing comparably or outperforming the best tuned fixed annealing schedules and an alternative data-dependent ε adaptation scheme proposed in the literature.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Balancing exploration with exploitation is a well-known and important problem in reinforcement learning (Sutton and Barto, 2018). If the behaviour policy focuses too much on exploration rather than exploitation, then this could hurt the performance in an on-line setting. Furthermore, on-policy algorithms such as SARSA or TD() might not converge to a good policy. On the other hand, if the exploration policy focuses too much on exploitation rather than exploration, then the state space might not be explored sufficiently and an optimal policy would not be found.

Historically, numerous exploration policies have been proposed for addressing the exploration-exploitation trade-off in model-free reinforcement learning, including Boltzmann exploration and epsilon-greedy (McFarlane, 2018). There, this trade-off is often controlled by one or more tuning parameters, such as in epsilon-greedy or the temperature parameter in Boltzmann exploration. However, these parameters typically have to be handcrafted or tuned for each task in order to obtain good performance. This motivates the design of exploration algorithms that adapt their behaviour according to some measure of the learning progress. The simplest approaches adapt the tuning parameters of a fixed class of exploration policies such as epsilon-greedy (Tokic, 2010). Other methods, such as count-based exploration (Thrun, 1992; Bellemare et al., 2016; Ostrovski et al., 2017) and Bayesian Q-learning (Dearden et al., 1998), use specialized techniques to develop new classes of exploration policies.

However, despite the recent developments in exploration strategies, epsilon-greedy is still often the exploration approach of choice (Vermorel and Mohri, 2005; Heidrich-Meisner, 2009; Mnih et al., 2015; Van Hasselt et al., 2016). Epsilon-greedy is both intuitive and simpler to tune than other approaches, since it is completely parameterized by one parameter, . Another benefit of this policy is that it can be easily combined with more sophisticated frameworks, such as options (Bacon et al., 2017). Unfortunately, the performance of epsilon-greedy in practice is highly sensitive to the choice of , and existing methods for adapting from data are ad-hoc and offer little theoretical justification.

In this paper, we take a fully Bayesian perspective on adapting based on return data. Recent work has demonstrated the strong potential of a Bayesian approach for parameter tuning in model-free reinforcement learning (Downey and Sanner, 2010)

. Another key advantage of a fully Bayesian approach over heuristics is the ability to specify priors on parameters, such as the predictive inverse variance of returns,

in this work, which are more robust to noise or temporary digressions in the learning process. In addition, our approach can be combined with other exploration policies such as Boltzmann exploration (Tokic and Palm, 2011). Specifically, we contribute:

  1. A new Bayesian perspective of expected SARSA as an -weighted mixture of two models, the greedy (Q-learning) bootstrap and one which averages uniformly over all Q-values (Section 4.1);

  2. A Bayesian algorithm -BMC (Algorithm 1) that is robust (Section 4.2), general, and adapts efficiently (Section 4.3);

  3. A theoretical convergence guarantee of our proposed algorithm (Theorem 1).

Empirically, we evaluate the performance of -BMC on domains with discrete and continuous state spaces, using tabular and approximate RL methods. We empirically show that our algorithm can outperform exploration strategies that fix or anneal based on time, and even existing adaptive algorithms. In the end, -BMC is a novel, efficient and general approach to adapting the exploration parameter in epsilon-greedy policies that empirically outperforms a variety of fixed annealing schedules and other ad-hoc approaches.

2 Related Work

Our paper falls within the scope of adaptive epsilon greedy algorithms. Perhaps the most similar approach to our work is the Value Differences Based Exploration (VDBE) algorithm of Tokic (2010), in which was modelled using a moving average and updated according to the Bellman (TD) error. However, that algorithm was presented for stationary-reward multi-armed bandits. Tokic and Palm (2011) combined the ideas of VDBE with Boltzmann exploration to create the VDBE-Softmax algorithm. dos Santos Mignon and da Rocha (2017) later developed a similar heuristic algorithm that worked on non-stationary multi-armed bandit problems. However, all these approaches are heuristic in nature; our paper approaches the problem from a Bayesian perspective.

Modelling Q-values using normal-gamma priors, as done in our paper, is a cornerstone of Bayesian Q-learning (Dearden et al., 1998)

. However, that paper is fundamentally different from ours, in that it addresses the problem of exploration by adding a bonus to the Q-values that estimates the myopic

value of perfect information. Our paper, on the other hand, applies the normal-gamma prior only to model the variance of the returns, while the exploration is handled using the epsilon-greedy policy with

modelled using a Beta distribution.

3 Preliminaries

3.1 Markov Decision Processes

In this paper, we denote a Markov decision process (MDP) as a tuple , where is a set of states, is a finite set of actions, is a transition function for the system state, is a bounded reward function, and is a discount factor.

Randomized exploration policies are sequences of mappings from states to probability distributions over actions. Given an MDP

, for each state-action pair and policy , we define the expected return

where is sampled from , are sampled from and . The associated value function is , and the objective is to learn an optimal policy that attains the supremum of over all policies. Puterman (2014) contains a more detailed treatment of this subject.

3.2 Reinforcement Learning

In the reinforcement learning setting, neither nor are known, so optimal policies are learned from experience, defined as sequences of transitions broken up into episodes. Here, states and rewards are sampled from the environment, and actions follow some exploration policy . Given an estimate of the expected return at time starting from state and taking action , temporal difference (TD) learning updates the Q-values as follows:

where is a problem-dependent learning rate parameter.

Typically, is computed by bootstrapping from the current Q-values, in order to reduce variance. Two of the most popular bootstrapping algorithms are Q-learning and SARSA, given respectively as:

(1)
(2)

Q-learning is an example of an off-policy algorithm, whereas SARSA is on-policy. Under relatively mild conditions and in tabular settings, Q-learning has been shown to converge to the optimal policy with probability one (Watkins and Dayan, 1992).

One additional algorithm that is important in this work is expected SARSA

(3)

which is similar to SARSA, but in which the uncertainty of the next action with respect to is averaged out. This results in considerable variance reduction as compared to SARSA, and theoretical properties of this algorithm are detailed in Van Seijen et al. (2009). Sutton and Barto (2018) provides a comprehensive treatment of reinforcement learning methods.

3.3 Epsilon-Greedy Policy

In this paper, exploration is carried out using -greedy policies, defined formally as

(4)

In other words, samples a random action from with probability , and otherwise selects the greedy action according to . As a result, can be interpreted as the relative importance placed on exploration.

The optimal value of the parameter is typically problem-dependent, and found through experimentation. Often, is annealed over time in order to favor exploration at the beginning, and exploitation closer to convergence (Sutton and Barto, 2018). However, such approaches are not adaptive since they do not take into account the learning process of the agent. In this paper, our main objective is to derive a data-driven tuning strategy for , that depends on current learning progress rather than trial and error.

4 Adaptive Epsilon-Greedy

In this section, we show how the expected return under epsilon-greedy policies can be written as an average of two return models weighted by . There are two relevant Bayesian methods for combining multiple models based on evidence: Bayesian model averaging (BMA), and Bayesian model combination (BMC). Generally, the Bayesian model combination approach is preferred to model averaging, since it provides a richer space of hypotheses and reduced variance (Minka, 2002). By interpreting

as a random variable whose posterior distribution can be updated on the basis of observed data, BMC naturally leads to a method for

adaptation.

4.1 A Bayesian Interpretation of Expected Sarsa With the Epsilon-Greedy Policy

We begin by combining the definition of expected SARSA (3) with the -greedy policy (4). For , , and we have

(5)

where is the Q-learning bootstrap (1) and

(6)

is an estimate that uniformly averages over all the action-values, which we call the uniform model. This leads to the following important observation: expected SARSA can be viewed as a probability-weighted average of two models, the greedy model that trusts the current Q-value estimates and acts optimally with respect to them, and the uniform model that completely distrusts the current Q-value estimates and consequently places a uniform belief over them. Under this interpretation, and are the posterior beliefs assigned to the two aforementioned models, respectively. In the following subsections, we verify this simple fact algebraically in the context of Bayesian model combination. We also develop a method for maintaining the (approximate) posterior belief state efficiently, with a computational cost that is constant in both space and time, and with provable convergence guarantees.

4.2 Bayesian Q-Learning

In order to facilitate tractable learning and inference, we assume that the return observation at time , given the model

, is normally distributed:

(7)

where the means and are given in (1) and (6), respectively, and is the inverse of the variance, or precision

. This assumption can be justified, by viewing the return as a discounted sum of future (random) reward observations, and appealing to the central limit theorem when

is close to 1 and the MDP is ergodic (Dearden et al., 1998).

There are two special cases of interest in this work. In the first case, is allowed to be constant across all state-action pairs and models, and naturally leads to a state-independent adaptation. This approach is particularly advantageous when it is costly or impossible to maintain independent statistics per state, such as when the state space is very large or continuous in nature. In the second case, independent statistics are maintained per state and lead to state-dependent exploration.

In order to update , we consider the standard normal-gamma model:

(8)

where are i.i.d. given and

. Since the returns in different state-action pairs are dependent, this assumption is likely to be violated in practice. However, it leads to a compact learning representation necessary for tractable Bayesian inference, and has been used effectively in the existing literature in similar forms

(Dearden et al., 1998). Furthermore, (8) is not used to model the Q-values directly in our paper, but rather, to facilitate robust estimation of , as we now show.

Given data of previously observed returns, the joint posterior distribution of and with likelihood (7) and prior (8

) is also normal-gamma distributed, and so the marginal posterior distribution of

, , is gamma distributed with parameters:

(9)

where and are the sample mean and variance of the returns in , respectively (Bishop, 2006). These quantities can be updated online after each new observation in constant time (Welford, 1962).

Finally, for each model , we marginalize over the uncertainty in , using (7) and (9) as follows:

Finally, we have:

(10)

where is the three-parameter Student t-distribution (Bishop, 2006). Therefore, marginalizing over the unknown precision leads to a t-distributed likelihood function. Alternatively, one could simply use the Gaussian likelihood in equation (7) and treat as a problem-dependent tuning parameter. However, the heavy tail property of the t-distribution is advantageous in the non-stationary setting typically encountered in reinforcement learning applications, where Q-values change during the learning phase. We now show how to link this update with the expected SARSA decomposition (5) to derive an adaptive epsilon-greedy policy.

4.3 Epsilon-Bmc: Adapting Epsilon Using Bayesian Model Combination

In the general setting of Bayesian model combination, we model the uncertainty in Q-values for each state-action pair as random variables with posterior distribution . The expected posterior return can be written as an average over all possible combinations of greedy and uniform model,

(11)

where is the weight assigned to the uniform model and is the weight assigned to the greedy model (Monteith et al., 2011). As will be verified shortly, the expectation of this weight given the past return data will turn out to be a Bayesian interpretation of .

The belief over is maintained as a posterior distribution . Continuing from (11) and using (10):

which is exactly (5) except that now . We have thus shown that the expected SARSA bootstraps with data-driven can be viewed in terms of Bayesian model combination. We denote this new estimate .

The posterior distribution is updated recursively by Bayes’ rule:

for every new observation . Since the number of terms in grows exponentially in , it is necessary to use posterior approximation techniques to compute .

One approach to address the intractability of computing an exact posterior

is to sample directly from the distribution. However, such an approach is inherently noisy and inefficient in practice. Instead, we apply the Dirichlet moment-matching technique

(Hsu and Poupart, 2016; Gimelfarb et al., 2018) that was shown to be effective at differentiating between good and bad models from data, easy to implement, and efficient. In particular, we apply the approach in Gimelfarb et al. (2018) with the models and , by matching the first and second moments of to those of the beta distribution and solving the resulting system of equations for and . The closed-form solution is:

(12)
(13)
(14)
(15)
(16)

where and are the respective probabilities of observing a return under the distributions (10). It follows that

(17)

All quantities, including , , and , can be computed online in constant time and with minimal storage overhead without caching . We call this approach -BMC and present the corresponding pseudo-code in Algorithm 1. Therein, lines with a * indicate additions to the ordinary expected SARSA algorithm.

1:  * initialize
2:  for each episode do
3:     initialize
4:     for each step in the episode do
5:        *
6:        choose action using -greedy policy (4)
7:        take action , observe and
8:        
9:        
10:         {note }
11:        
12:        * update and using observation
13:        * compute and using (9)
14:        * compute and using (10)
15:        * update and using (12)-(16)
16:        
17:     end for
18:  end for
Algorithm 1 -BMC with Expected SARSA

In Algorithm 1, the expected SARSA return was used to update both the posterior on and the Q-values. However, it is possible to update the Q-values in line 11 using a different estimator of the future return, including Q-learning (1), SARSA (2

), or other approaches. The Q-values could also be approximated using a deep neural network or other function approximator. The algorithm can also be run off-line by caching

for an entire episode and then updating . State-dependent exploration can be easily implemented by maintaining the posterior statistics independently for each state, or approximating them using a neural network.

A final advantage of our algorithm is a provable convergence guarantee under fairly general assumptions.

Theorem 1 (Monotone Convergence of -Bmc).

Suppose . Then, for all , therefore converges as .

Proof.

Let and observe that:

(18)

where the first line uses (15)-(17), the second uses (12) and the third uses (15) and (16). Then, since :

Now, if , then this implies that

and from (4.3) we conclude that . The first statement of the theorem follows from the assumption and a standard induction argument. The second statement follows from the monotone convergence theorem (see, e.g. Rudin (1976), pg. 56). ∎

The convergence of -BMC holds using any value function representation, including neural networks. It is important to note that convergence of -BMC can only be guaranteed when is initialized in . However, this is not a concern in practice, since it has been found that there is no significant gain in using values of larger than 0.5 (dos Santos Mignon and da Rocha, 2017).

5 Empirical Evaluation

To demonstrate the ability of Algorithm 1 to adapt in a variety of environments, we consider a deterministic, finite state grid-world domain, the continuous state cart-pole control problem, and a stochastic, discrete state supply-chain problem. The third domain was chosen to show how our algorithm performs when the action space is large and the problem is stochastic. We considered two different reinforcement learning algorithms: on-policy tabular expected SARSA (Sutton and Barto, 2018), and off-policy DQN with experience replay (Mnih et al., 2015) 111As noted in the previous section, we only need to replace line 11 of Algorithm 1 with the DQN update.. The parameter settings are listed in Tables 1 and 2 in the supplementary materials 222The code and supplementary materials can be found at https://github.com/mike-gimelfarb/bayesian-epsilon-greedy.

. All experiments were run independently 100 times, and mean curves with shaded standard error are reported.

In the empirical evaluation of Algorithm 1, our goal is to quantify the added value of adapting the parameter in epsilon-greedy policies using a Bayesian approach, rather than compare the performance of epsilon-greedy policies against other approaches, which has been investigated in the literature in various settings (Vermorel and Mohri, 2005; Tijsma et al., 2016). Therefore, Algorithm 1 is compared against different annealing schedules for , broken down into the following categories:

  • constant: , where ;

  • geometric: , where and is the episode number;

  • power: , where and is the episode number;

  • adaptive: VDBE (Tokic, 2010) with , , and .

We do not compare to Tokic and Palm (2011), since that paper falls outside the scope of epsilon-greedy policies. However, we reiterate that it is a trivial matter to interchange VDBE and -BMC in that framework.

5.1 Grid-World

The first benchmark problem is the discrete deterministic 5-by-5 grid-world navigation problem with sub-goals presented in Ng et al. (1999). Valid moves incur a cost of 0.1, and invalid moves incur a cost of 0.2, in order to encourage the agent to solve the task in the least amount of time. We set . Testing consists of running a single episode, starting from the same initial state, using the greedy policy at the end of each episode. The results are shown in Figures 1 and 2.

Figure 1: Average performance (steps to reach the final goal) on the grid-world domain using expected SARSA.
Figure 2: Average performance (steps to reach the final goal) on the grid-world domain using deep Q learning.

5.2 Cart-Pole

The second problem is the continuous deterministic cart-pole control problem. A reward of 1.0 is provided until the pole falls, to encourage the agent to keep the pole upright. We also set . To run the tabular expected SARSA algorithm and VDBE, we discretize the four-dimensional state space into equal regions. Since the initial position is randomized, testing consists of evaluating the greedy policy on 10 independent episodes and averaging the returns. Since over-fitting was a significant concern for DQN, we stop training as soon as perfect test performance (the pole has not been dropped) was observed over four consecutive episodes. The results are shown in Figures 3 and 4.

Figure 3: Average performance (time steps the pole is balanced) on the cart-pole domain using expected SARSA.
Figure 4: Average performance (time steps the pole is balanced) on the cart-pole domain using deep Q-learning.

5.3 Supply-Chain Management

This supply-chain management problem was described in Kemmer et al. (2018), and consists of a factory and warehouses. The agent must decide how much inventory to produce at the factory and how much inventory to ship to the warehouse(s) to meet the demand. The parameters used in our experiment are: , , , , ,

, and demand follows a Poisson distribution with rate

. We also set a transportation limit of items per period, and assume that unfulfilled demand is not backlogged, but lost forever. The initial state is always for training and testing (e.g. 10 items initially at the factory and 0 at the warehouse). We set . Like cart-pole, testing is done by averaging the returns of 10 independent trials using the greedy policy. The results are illustrated in Figures 5 and 6.

Figure 5: Average performance (return) on the supply-chain domain using expected SARSA.
Figure 6: Average performance (return) on the supply-chain domain using deep Q-learning.

5.4 Discussion

Overall, we see that -BMC consistently outperformed all other types of annealing strategies, including VDBE, or performed similarly. However, -BMC converged slightly later than VDBE on the grid-world domain and the fixed annealing strategy on the supply-chain problem, using tabular expected SARSA. However, in the former case, -BMC outperformed all fixed tuning strategies, and in the latter case, it outperformed VDBE by a large margin. These observations are related to the speed of convergence; asymptotically, -BMC approached the performance of the best policy that was attained (for grid-world this is indeed the optimal policy).

While it performed well on the simple grid-world domain, VDBE performed considerably worse than -BMC on the more complex supply-chain problem. We believe that the Bayesian approach of -BMC smooths out the noise in the return signals better than VDBE and other ad-hoc approaches for adapting . This also suggests why our algorithm performed better on DQN.

Furthermore, we see that no single family of annealing strategies worked consistently well across all domains and algorithms. For instance, geometric decay strategies worked well on the grid-world domain, while performing poorly on the supply-chain problem using tabular SARSA. The power decay strategies worked well on the supply-chain problem using tabular SARSA, but failed to match the performance of other strategies when switching to DQN. Also, the performance of VDBE was highly sensitive to the choice of the parameter. A lower value of worked well for grid-world and cart-pole, but higher values of worked better for supply-chain. The performance of -BMC was relatively insensitive to the choice of prior parameters for and (), so we were able to use the same values in all our experiments. However, unsurprisingly, it was more sensitive to the strength of the prior on (, ). Since we can always set , this effectively reduces to the problem of selecting a single parameter that controls the strength of the prior on . This is considerably easier to do than to select both a good annealing method and the tuning parameter(s).

6 Conclusion

In this paper, we proposed a novel Bayesian approach to solve the exploration-exploitation problem in general model-free reinforcement learning, in the form of an adaptive epsilon-greedy policy. Our novel algorithm, -BMC, is a novel approach for tuning the parameter automatically from return observations based on Bayesian model combination and approximate moment-matching based inference. It was argued to be general, efficient, robust, and theoretically grounded, and was shown empirically to outperform fixed annealing schedules for and even a state-of-the-art adaptation scheme.

In future work, it would be interesting to evaluate the performance of -BMC combined with Boltzmann exploration (Tokic and Palm, 2011), as well as the state-dependent version. We believe that it is possible to obtain a Bayesian interpretation of VDBE by placing priors over the Bellman errors and updating them using data, but we have not investigated this approach. It would also be interesting to extend our approach to handle options.

Acknowledgements

We thank the reviewers for their insightful comments and suggestions.

References

References

  • P. Bacon, J. Harb, and D. Precup (2017) The option-critic architecture. In

    Thirty-First AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1.
  • M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos (2016) Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479. Cited by: §1.
  • C. M. Bishop (2006) Pattern recognition and machine learning. Springer. Cited by: §4.2, §4.2.
  • R. Dearden, N. Friedman, and S. Russell (1998) Bayesian q-learning. In AAAI/IAAI, pp. 761–768. Cited by: §1, §2, §4.2, §4.2.
  • A. dos Santos Mignon and R. L. d. A. da Rocha (2017) An adaptive implementation of -greedy in reinforcement learning. Procedia Computer Science 109, pp. 1146–1151. Cited by: §2, §4.3.
  • C. Downey and S. Sanner (2010) Temporal difference bayesian model averaging: a bayesian perspective on adapting lambda.. In International Conference on Machine Learning, pp. 311–318. Cited by: §1.
  • M. Gimelfarb, S. Sanner, and C. Lee (2018) Reinforcement learning with multiple experts: a bayesian model combination approach. In Advances in Neural Information Processing Systems, pp. 9549–9559. Cited by: §4.3.
  • V. Heidrich-Meisner (2009) Interview with richard s. sutton.. KI 23 (3), pp. 41–43. Cited by: §1.
  • W. Hsu and P. Poupart (2016) Online bayesian moment matching for topic modeling with unknown number of topics. In Advances In Neural Information Processing Systems, pp. 4536–4544. Cited by: §4.3.
  • L. Kemmer, H. von Kleist, D. de Rochebouet, N. Tziortziotis, and J. Read (2018) Reinforcement learning for supply chain optimization. In European Workshop on Reinforcement Learning 14, Cited by: §5.3.
  • R. McFarlane (2018) A survey of exploration strategies in reinforcement learning. Note: Unpublished Cited by: §1.
  • T. P. Minka (2002) Bayesian model averaging is not model combination. Note: Unpublished Cited by: §4.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1, §5.
  • K. Monteith, J. L. Carroll, K. Seppi, and T. Martinez (2011) Turning bayesian model averaging into bayesian model combination. In Neural Networks (IJCNN), pp. 2657–2663. Cited by: §4.3.
  • A. Y. Ng, D. Harada, and S. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. In International Conference on Machine Learning, Vol. 99, pp. 278–287. Cited by: §5.1.
  • G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos (2017) Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2721–2730. Cited by: §1.
  • M. L. Puterman (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §3.1.
  • W. Rudin (1976) Principles of mathematical analysis. Third edition, McGraw-Hill Book Co., New York. External Links: ISBN 0-07-085613-3, MathReview Cited by: §4.3.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1, §3.2, §3.3, §5.
  • S. B. Thrun (1992) Efficient exploration in reinforcement learning. Technical report CMU-CS-92-102, School of Computer Science, Carnegie Mellon. Cited by: §1.
  • A. D. Tijsma, M. M. Drugan, and M. A. Wiering (2016) Comparing exploration strategies for q-learning in random stochastic mazes. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–8. Cited by: §5.
  • M. Tokic and G. Palm (2011) Value-difference based exploration: adaptive control between epsilon-greedy and softmax. In Annual Conference on Artificial Intelligence, pp. 335–346. Cited by: §1, §2, §5, §6.
  • M. Tokic (2010) Adaptive -greedy exploration in reinforcement learning based on value differences. In Annual Conference on Artificial Intelligence, pp. 203–210. Cited by: §1, §2, 4th item.
  • H. Van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1.
  • H. Van Seijen, H. Van Hasselt, S. Whiteson, and M. Wiering (2009) A theoretical and empirical analysis of expected sarsa. In Adaptive Dynamic Programming and Reinforcement Learning, 2009. ADPRL’09. IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 177–184. Cited by: §3.2.
  • J. Vermorel and M. Mohri (2005) Multi-armed bandit algorithms and empirical evaluation. In European conference on machine learning, pp. 437–448. Cited by: §1, §5.
  • C. J. Watkins and P. Dayan (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §3.2.
  • B. Welford (1962) Note on a method for calculating corrected sums of squares and products. Technometrics 4 (3), pp. 419–420. Cited by: §4.2.

Supplementary Material

Tables 1 and 2 describe parameter settings used in the experimentation for SARSA and DQN, respectively.

PARAMETER DESCRIPTION GRID-WORLD CART-POLE SUPPLY-CHAIN
Table initialization uniform on [0, 0.1] zeros uniform on [0, 0.1]
Learning rate ( episode #) 0.7 0.6
Max. episode length 200 200 200
Prior parameter in (8) 0 0 0
Prior parameter in (8) 1 1 1
Prior parameter in (8) 500 500 500
Prior parameter in (8) 500 500 500
Prior parameter for 1 10 1000
Prior parameter for 1 + 0.01 10 + 0.01 1000 + 0.01
Table 1: Parameter settings for the tabular expected SARSA algorithm.
PARAMETER DESCRIPTION GRID-WORLD CART-POLE SUPPLY-CHAIN
Network initialization Glorot uniform Glorot uniform Glorot uniform
Network topology 16-25-25-4 4-12-12-2 102-100-100-100
Hidden activation ReLU ReLU ReLU
Regularization none L2() none
State encoding one-hot none one-hot
Learning rate 0.001 0.0005 0.001
Replay buffer size 2000 2000 3000
Batch size 24 32 64

Training epochs per batch

5 3 2
Max. episode length 200 200 200
Prior parameter in (8) 0 0 0
Prior parameter in (8) 1 1 1
Prior parameter in (8) 500 500 500
Prior parameter in (8) 500 500 500
Prior parameter for 1 5 25
Prior parameter for 1 + 0.01 5 + 0.01 25 + 0.01
Table 2: Parameter settings for the deep Q-learning algorithm.