Abstract
Bayesian approaches provide a principled solution to the explorationexploitation tradeoff in Reinforcement Learning. Typical approaches, however, either assume a fully observable environment or scale poorly. This work introduces the Factored BayesAdaptive POMDP model, a framework that is able to exploit the underlying structure while learning the dynamics in partially observable systems. We also present a belief tracking method to approximate the joint posterior over state and model variables, and an adaptation of the MonteCarlo Tree Search solution method, which together are capable of solving the underlying problem nearoptimally. Our method is able to learn efficiently given a known factorization or also learn the factorization and the model parameters at the same time. We demonstrate that this approach is able to outperform current methods and tackle problems that were previously infeasible.
1. Introduction
Robust decisionmaking agents in any nontrivial system must reason over
uncertainty in various dimensions such as action outcomes, the agent’s
current state and the dynamics of the environment. The outcome and state
uncertainty are elegantly captured by POMDPs pomdp , which enable
reasoning in stochastic, partially observable environments.
The POMDP solution methods, however, assumes complete knowledge to the system
dynamics, which unfortunately are often not easily available. When such a model
is not available, the problem turns into a Reinforcement Learning (RL) task, where one must
consider both the potential benefit of learning as well as that of exploiting
current knowledge. Bayesian RL addresses this explorationexploitation
tradeoff problem in a principled way by explicitly considering the uncertainty
over the unknown parameters. While Modelbased Bayesian RL have been applied to
partially observable models ross2011bayesian , these approaches do not
scale to problems with more than a handful of unknown parameters. Crucially,
they model the dynamics of the environment in a tabular fashion which are
unable to generalize over similar states and thus unable to exploit the
structure of realworld applications.
Earlier work for fully observable environments tackle this issue by
representing states with features and the dynamics as
graphs structuredMDP . Their formulation for the MDP case, however,
does not accommodate for environments that are either partially hidden or where
the perception of the state is noisy.
In this work we introduce the Factored BayesAdaptive POMDP (FBAPOMDP), which captures partially observable environments with unknown dynamics, and does so by exploiting structure. Additionally we describe a solution method based on the MonteCarlo Tree Search family and a mechanism for maintaining a belief specifically for the FBAPOMDP. We show the favourable theoretical guarantees of this approach and demonstrate empirically that it outperforms the current stateoftheart methods. In particular, our method outperforms previous work on domains, of which one is too large to be tackled by solution methods based on the tabular BAPOMDP.
2. Background
We first provide a summary of the background literature. This section is divided into an introduction to the POMDP and BAPOMDP, typical solution methods for those models, and factored models.
2.1. The POMDP & BAPOMDP
The POMDP pomdp is a general model for decisionmaking in stochastic and partially observable domains, with execution unfolding over (discrete) time steps. At each step the agent selects an action that triggers a state transition in the system, which generates some reward and observation. The observation is perceived by the agent and the next time step commences. Formally, a POMDP is described by the tuple , where is the set of states of the environment; is the set of actions; is the set of observations;
is the ‘dynamics function’ that describes the dynamics of the system in the form of transition probabilities
;^{1}^{1}1This formulation generalizes the typical formulation with separate transition and observation functions : . In our experiments, we do employ this typical factorization. is the immediate reward function that describes the reward of selecting in ; is the discount factor; and is the horizon of an episode in the system. In this description of the POMDP, captures the probability of transitioning from state to the next state and generating observation in the process for each action .The goal of the agent in a POMDP is to maximize the expectation over the cumulative (discounted) reward, also called the return. The agent has no direct access to the system’s state, so it can only rely on the actionobservation history up to the current step
. It can use this history to maintain a probability distribution over the state, also called a belief,
. A solution to a POMDP is then a mapping from a belief to an action , which is called a . Solution methods aim to find an optimal policy, a mapping from a belief to an action with the highest possible expected return.The POMDP allows solution methods to compute the optimal policy given a complete description of the dynamics of the domain. In many real world applications such a description is not readily available. The BAPOMDP ross2011bayesian is a modelbased Bayesian Reinforcement Learning framework to model applications where those are hard to get, allowing the agent to directly reason about its uncertain over the POMDP model. Conceptually, if one observed both the states and observations, then we could count the number of the occurrences of all transitions and store those in , where we write for the number of times that is followed by . The belief over could then compactly be described by Dirichlet distributions, supported by the counts
. While the agent cannot observe the states and thus has uncertainty about the actual count vector, this uncertainty can be represented using regular POMDP formalism. That is, the count vector is included as part of the hidden state of the POMDP.
Formally, the BAPOMDP is the POMDP with (hyper) state space , where is the countably infinite space of assignments of . While the observation and action space remain unchanged, a state in the BAPOMDP now includes Dirichlet parameters: . The reward function still only depends on the underlying POMDP state: . The dynamics of the BAPOMDP, , factorize to , where corresponds to the expectation of according to :
(1) 
If we let denote a vector of the length of containing all zeros except for the position corresponding to (where it is ), and if we let denote the Kronecker delta that indicates (is when) , then we denote and can write as . Lastly, just like any Bayesian method, the BAPOMDP requires a prior , the initial joint belief over the domain state and dynamics. Typically the prior information about can be described with a single set of counts , and
reduces to the joint distribution
where is the distribution over the initial state of the underlying POMDP.2.2. Learning by Planning in BAPOMDPs
The countably infinite state space of the BayesAdaptive model poses a challenge to offline solution methods due to the curse of dimensionality. Partially Observable MonteCarlo Planning (POMCP)
POMCP , a MonteCarlo Tree Search (MCTS) based algorithm, does not suffer from this curse as its complexity is independent of the state space. As a result, the extension to the BayesAdaptive case, BAPOMCP katt2017learning , has shown promising results.POMCP incrementally constructs a lookahead actionobservation tree using MonteCarlo simulations of the POMDP. The nodes in this tree contain statistics such as the number of times each node has been visited and the average (discounted) return that follows. Each simulation starts by sampling a state from the belief, and traverses the tree by picking an actions according to UCB auer2002finite
and simulating observations according to the POMDP model. Upon reaching a leafnode, the tree is extended with a node for that particular history and generates an estimate of the expected utility of the node. The algorithm then propagates the accumulated reward back up into the tree and updates the statistics in each visited node. The action selection terminates by picking the action at the root of the tree that has the highest average return.
The key modifications of the application POMCP to BAPOMDPs are twofold: (1) a simulation starts by sampling a hyperstate at the start and (2) the simulated step follows the dynamics of the BAPOMDP (algorithm 1). During this step first the domain state transitions and an observation is generated according to (algorithm 1 line 6), which in turn are then used to update the counts (algorithm 1 line 7).
Given enough simulations, BAPOMCP converges to the optimal solution with
respect to the belief it is sampling states from katt2017learning . One
can compute this belief naively in closed form in finite state spaces by
iterating over all the possible next states using the model’s
dynamics ross2011bayesian . This quickly becomes infeasible and is only
practical for very small environments. More common approaches approximate the
belief with particle filters Thrun99 . There are numerous methods
to update the particle filter after executing action and receiving
observation , of which Rejection Sampling has traditionally been used for
(BA)POMCP. Importance Sampling gordon1993novel (outlined
in algorithm 2), however, has been shown to be superior in
terms of the chisquared distance chen2005another .
In Importance Sampling the belief is represented by a weighted particle filter, where each particle is associated with a weight that represents its probability . Importance Sampling recomputes the new belief given an action and observation with respect to the model’s dynamics in two steps. First, each particle is updated using the transition dynamics , and then weighted according to the observation dynamics . Note that the sum of weights of the belief after this step represents the likelihood of the belief update at time . The likelihood of the entire belief given the observed history can be seen as the product of the likelihood of each update step . In the second step, starting on creftype 13 of algorithm 2, the belief is resampled, as is the norm in sequential Importance Sampling.
2.3. Factored Models
Just like most multivariate processes, the dynamics of the POMDP can
often be represented more compactly with graphical models than by tables:
conditional independence between variables leads to the reduction of the
parameter space, leading to simpler and more efficient models.
The Factored POMDP (FPOMDP) boutilier1996computing represents the
states and observations with features and the dynamics as a collection
of BayesNets (BN), one for each action.
Let us denote the featured state space into features, and observation space into features. Then, more formally, a BN as a dynamics model for a particular action consists of an input node for each state feature and an output node for each state and observation feature and . The topology describes the directed edges between the nodes, of which the possible graphs in is restricted such that the input nodes only have outgoing edges and the observation nodes only have incoming edges. For simplicity reasons we also assume that the output state nodes are independent of themselves. The Conditional Probability Tables (CPTs) describe the probability distribution over the values of the nodes given their (input) parent values . The dynamics of a FPOMDP are then defined as follows
Some approaches are able to exploit the factorization of FPOMDPs, which typically leads to better solution boutilier1996computing . These methods, however, operate under the assumption that the dynamics are known apriori and hence cannot be applied to applications where this is not the case.
3. Bayesian RL in Factored POMDPs
The BAPOMDP provides a Bayesian framework for RL in POMDPs, but is unable to describe or exploit structure that many real world applications exhibit. It also scales poorly, as the number of parameters grow quadratic in the state space, , where only one parameter (count) is updated after each observation. Here we introduce the Factored BAPOMDP (FBAPOMDP), the BayesAdaptive framework for the factored POMDP, that is able to model, learn and exploit such relations in the environment.
3.1. The Factored BAPOMDP
If we consider the case where the structure of the underlying POMDP is known apriori, but its parameters are not, then it is clear
that we could define a BayesAdaptive model where the counts describe Dirichlet distributions over the CPTs: (which we know how to maintain
over time). However, this assumption is unrealistic, so we must also
consider both the topology and its parameters as part of the
hidden parameters, in addition to the domain state .
We define the FBAPOMDP as a POMDP with the (hyper) state space . Let us first consider its dynamics . This joint probability can be factored using the same standard independence assumptions made in the BAPOMDP:
(2)  
(3)  
(4) 
Term (eq. 2) corresponds to the expectation
of , under the joint Dirichlet posterior over CPTs
. Given the expected CPTs , that probability is described by the product governed
by the topology: . Throughout the rest of the paper we
will refer to this probability eq. 2 with .
(eq. 3) describes the update operation on the counts that correspond to : . Note that this will update counts, one for each feature. Lastly, we assume the topology of is static over time, which reduces (eq. 4) to the Kronecker delta function . This leads to the following definition of the FBAPOMDP model, given tuple :

, , : Identical to the underlying POMDP.

ignores the counts and reduces to the reward function of the POMDP just like in the BAPOMDP.

: . Set of possible observations defined by their features.

: . The cross product of the domain’s factored state space and the set of possible topologies, one for each action , and their respective Dirichlet distribution counts.

: , as described above.
A prior for the FBAPOMDP is a joint distribution over the hyperstate
. In many applications the influence relationships between features is known apriori for large parts of the domain. For the unknown parts, one could consider a uniform distribution, or distributions that favours few edges.
3.2. Solving FBAPOMDPs
Solution methods for the FBAPOMDP face similar challenges as those for
BAPOMDPs with respect to uncountable large (hyper) state spaces as a
result of the uncertainty over current state and the dynamics. So it is
only natural to turn to POMCPbased algorithms for inspiration.
BAPOMCP extends MCTS to the BayesAdaptive case by initiating simulations with a sample (from the belief) and applying the BAPOMDP dynamics to govern the transitions (recall algorithm 1). We propose a similar POMCP extension, the FBAPOMCP, for the factored case where we sample a hyperstate at the start of each simulation, and apply the FBAPOMDP dynamics to simulate steps. This is best illustrated in algorithm 3, which replaces BAPOMCPstep algorithm 1. During a step the sampled is used to sample a transition, after which the counts associated with that transition are updated.
3.3. Belief tracking & Particle Reinvigoration
Because structures in the particles are not updated over time and due to
particle degeneracy, traditional particle filter belief update schemes then to
converge to a single structure, which is inconsistent with the true posterior,
leading to poor performance.
To tackle this issue, we propose a MCMCbased sampling scheme to occasionally
reinvigorate the belief with new particles according to the (observed) history
.
First we introduction the notation which describes
the (sequence of) values of from time step to of real
interactions with the environment, with the special case of , which
corresponds to the value of at time step (where can be a
sequence of states, action or observations). For brevity we also use
‘model’ and the tuple interchangeably in this
section, as they represent the dynamics of a POMDP. Lastly, we refer to
as the last time step in our history.
On the highest level we apply Gibbs sampling, which approximates a joint distribution by sampling variables from their conditional distribution with the remaining variables fixed: we can sample by picking some initial , and sampling followed by . Here we pick and and sample alternatively a model given a state sequence and a state sequence given a model:
Sample step (i) samples a statesequence given the observed history
and current model. A very simple and naive approach is to use Rejection Sampling to sample state and observation sequences based on the action history and reject them based on observation equivalence, optionally exploiting the independence between episodes. Due to the high rejection rate, this approach is impractical for nontrivial domains. Alternatively, we model this task as sampling from a Hidden Markov Model, where the transition probabilities are determined by the model
and action history .To sample a hidden state sequence from an HMM given some observations (also
called smoothing) one typically use forwardbackward messages to compute
the conditionals probability of the hidden states
efficiently rabiner1986introduction . We first compute the conditional
with
backwardmessages and then sample hierarchically in a single
forward pass.
The second conditional (ii) in the Gibbs sampling scheme is from distribution , which
itself is split into two sample steps. We first (a) sample using MetropolisHastings. This is followed by sampling a set
of counts (b) , which is
a deterministic function that simply takes the prior and counts the
transitions in the history . For the first sample step
(ii a) we start from the general MetropolisHastings case:
MetropolisHastings samples some distribution using a proposal distribution and testing operation. The acceptance test probability of is defined as . More specifically, given some initial value , MetropolisHastings consists of:

sample

with probability :

store and go to (1)
Let us take as and to be domain specific and symmetrical, then we derive the following MetropolisHastings step for (ii a):
Which leads to the likelihood ratio between the two graph structures. It has been shown that the likelihood , given some mild assumptions (such as that the prior is a Dirichlet), this value is given by the BDscore metric heckerman1995learning : . Given some initial set of prior counts for , , and a dataset of occurrences of values with parent values for node provided by , then the score is computed as follows:
Where we abuse notation and denote the total number of counts,
, as (and similarly ).
Given this acceptance probability, MetropolisHastings can sample a new set of graph
structures with corresponding counts for the CPTs . This particular
combination of MCMC methods — MetropolisHastings in one of Gibbs’s conditional sampling
steps — is also referred to as MHwithinGibbs and, surprisingly, has shown to
converge to the true distribution even if the MetropolisHastings part only consist of 1
sample per
step lee2002particle ; martino2015independent ; tierney1994markov ; robert2013monte ; liang2011advanced .
The overall particle reinvigoration procedure, assuming some initial , is as follows:

sample from HMM

sample from MH: (using BDscores)

compute counts:

add to belief and go to
It is not necessary to do this operation at every time step, instead the likelihood of the current belief is a useful metric to determine when to resample. Fortunately, it is a byproduct of Importance Sampling at line 9 of algorithm 2. The total accumulated weight, denoted as (the normalization constant) is the likelihood of the belief update. Starting with at , we maintain the likelihood over time and update the posterior whenever the drops below some threshold.
3.4. Theoretical guarantees
Here we consider theoretical aspects of our proposed solution method.
We first note that FBAPOMCP converges to the optimal solution with
respect to the belief, and secondly point out that the proposed belief
tracking scheme converges to the true belief.
Analysis from POMCP proof that the value function constructed by
POMCP, given some suitable exploration constant , converges to the
optimal value function with respect to the initial belief. Work on
BAPOMCP katt2017learning extends the proofs to the BAPOMDP.
Their proof relies on the fact that the BAPOMDP is a POMDP (that
ultimately can be seen as a belief MDP), and that BAPOMCP simulates
experiences with respect to the dynamics . These notions also
apply to FBAPOMDP and we can directly extend the proofs to our solution
method.
Given that FBAPOMCP converges to the optimal value function with respect to the belief, it is important to consider whether that the belief as a result of our particle reinvigoration approximates the true posterior (note that we are only concerned with the reinvigoration part of the belief update, as it is widely known that Importance Sampling with particle filters is unbiased). This follows directly from the convergence properties of Gibbs sampling, MetropolisHastings and MHwithinGibbs that have been used to sample from the posterior. Since these methods are unbiased approximations and we use them directly to sample from the true posterior , we show that our solution method converges to the true distribution (given the initial belief).
4. Experiments
Here we provide empirical support for our factored BayesAdaptive approach on domains: the Factored Tiger, Collision Avoidance, and Gridworld problem. The Factored Tiger problem is an extension of the wellknown Tiger problem pomdp , the Collision Avoidance problem is taken from luo2016importance and the Gridworld is inspired by navigational tasks. In this section we first describe each domain on a high level, followed by the prior information we assume given to the agent. For more details please refer to the appendix.
4.1. Setup
The Tiger domain describes a scenario where the agent is faced with the
task of opening one out of two doors. Behind one door lurks a tiger, a
danger and reward of that must be avoided, while the other door
opens up to a bag of gold for a reward of . The agent can choose to
open either doors (which ends the episode) or to listen for a signal: a
noisy observation for a reward of . This observation informs the agent of
the location of the tiger with accuracy. In the Factored Tiger
domain we increase the state space artificially by adding uninformative
binary state features. While these features increase the state space, they
are stationary over time and do not affect the observation function. From a
planning point of view the problem retains its complexity regardless of the
number of features, the challenge for a learning agent however is to infer
the underlying dynamics in the significantly large domain.
In this particular case, the agent is unsure about the observation
function. In particular, the prior belief of the agent assigns
probability to hearing the tiger correctly. The prior belief over the
structure of the observation model is uniform. This means that each edge
from any of the state features to the observation feature has a
chance of being present in a particle in the initial belief.
In the Collision Avoidance problem the agent pilots a plane that flies from
(centre figure) right to left ( cell at a time) in a dimensional
grid. The agent can choose to stay level for no cost, or move either
diagonally up or down with a reward of . The episode ends when the
plane reaches the column on the left, where it must avoid collision with a
vertically moving obstacle (or face a reward of ). The obstacle
movement is stochastic, and the agent observes its location at each time
step with some noise. The optimal policy attempts to avoid the obstacle
with as little vertical movement as possible.
While we assume the agent knows the observation and transition model of the
plane, the agent initially underestimates the movement strategy of the
obstacle: it believes it will stay put of the time and move either
direction with probability each, while the actual probabilities are
respectively and . The agent knows that the location of the
obstacle in the next state depends on its previous location apriori, but
otherwise assume no initial acknowledge on the structure of the model with
respect to the location of the object.
Gridworld, is a dimensional grid in which the agent starts in the
bottom left corner and must navigate to a goal cell. The goal cell is
chosen from a set of candidates at the start of an episode, and can be
fully observed by the agent. The agent additionally observes its own
location with a noisy localizer. The agent can move in all directions,
which are generally successful of the attempts. There are, however,
specific cells that significantly decrease the chance of success to ,
essentially trapping the agent. The target of the agent is to reach the
goal as fast as possible.
In this domain we assume no prior knowledge of the location or the number of ‘trap’ cells and the prior assigns probability of transition success on all cells. The observation model in this domain is considered known. Here we factor the state space up into the location of the goal state and the position of the agent and assume the agent knows that its next location is dependent on the previous, but is unsure whether the goal location is necessary to model its transition probabilities. This results in a prior belief where all particles contain models where, for each action, the and values of the current location are used to predict the next location of the agent, and half the particles also include the value of the goal cell as input edge.
The shaded areas indicate the confidence interval.
4.2. Analysis
We compare our method to other solution methods
(fig. 1). We consider the BAPOMCP agent,
as the baseline approach that ignores factorization and attempts to learn
in the problems framed as BAPOMDPs. A second approach called
knowsstructure acts with complete prior knowledge of the
structure of the dynamics and could be considered as a bestcase scenario.
This method requires additional knowledge of the domain and thus can be
considered as ‘cheating’ compared to the other methods^{2}^{2}2
Note, however that it is more common to know whether domain features
share dependencies than it is to know the true probabilities
.
Thirdly we test an agent noreinvigoration with the same prior
knowledge as our method, except that it does not reinvigorate its belief.
The comparison with this approach highlights the contribution of the
reinvigoration step proposed by us to keep a healthy distribution over the
structure of the dynamics.
In order to produce statistically significant results we ran the
experiments described above a large number of times. In these experiments,
the parameters of the MonteCarlo Tree Search planning and Importance Sampling belief update were equal
across all solution methods. We refer to the appendix for more details.
We first make the general observation that our method consistently either
outperforms or is on par with the other methods, even compared to the
knowsstructure approach whose prior knowledge is more accurate.
In the Collision Avoidance domain our method outperforms even the agent with complete
knowledge to the model topology with statistical significance knowsstructure.
This implies that particle reinvigoration (with respect to the true,
complete posterior) is beneficial even when structure degeneracy is not a
problem, because it improves the belief over the model by better
approximating the distribution over . The dynamics of this problem
are particularly subtle, causing the belief over the graph topologies in
the noreinvigoration agent to converge to something different from the true
model. As a result, it is unable to exploit the compactness of the true
underlying dynamics and its performance is similar to BAPOMCP. None
of the agents in the Collision Avoidance problem (left graph) have converged yet due to
the lack of learning time provided in the episodes (which we cut off
in the interest of time).
Gridworld has comparatively less subtle transitions, and all methods show a
generally quicker learning pace compared to other domains (centre figure).
Nevertheless, BAPOMCP has not converged to the true model yet after
episodes whereas the factored approaches (in particular our method and knowsstructure) do so after less than .
The results on the Factored Tiger problem (right figure) show significantly different behaviour. Firstly, the initial performance of both our method and noreinvigoration is worse than the BAPOMCP and knowsstructure. The reason becomes obvious once you realize that due to the uniform prior over the structure, half of the models in the initial belief contain topologies that cannot represent the intended prior counts. This leads to a change in the initial belief and thus the initial performance is different. With a uniform prior over the structure and without reinvigoration, the agent could accidentally end up converging to a model structure that is unable to express the true underlying dynamics. This is shown spectacularly by noreinvigoration, where the average performance reduces over time. More detailed qualitative analysis showed that in most runs the agent actually performs similar to the other FBAPOMDP agents, however, every now and then the belief converges to a model structure that does not include the tiger location as parent feature in the observation model, leading to a policy that opens doors randomly with an expected return of . Lastly, the lack of improvement of the BAPOMCP agent emphasises the need of factored representations most of all. Due to the large state space the number of variables in the observation model grows too large to learn individually. As a result, even episodes are not enough for the agent to learn a model.
5. Related work
Much of the recent work in Reinforcement Learning in partially observable environments has
been in applications of Deep Reinforcement Learning to POMDPs. To tackle the issue of
remembering past observations, researchers have attempted to employ
Recurrent networks hausknecht2015deep ; zhu2018improving . Others have
introduced inductive biases into the network in order to learn a generative
model to imitate belief updates igl2018deep . While these approaches
are able to tackle largescale problems, they are not Bayesian and hence do
not share the same theoretical guarantees.
More traditional approaches include the UTree
algorithm mccallum1996reinforcement (and its modifications),
EMbased algorithms such as liu2013online and policy gradient
descent methods baxter2000direct . One of their main drawbacks is
that they do not address the fundamental challenge of the
explorationexploitation tradeoff in POMDPs.
There are other approaches that directly try to address this issue. The InfiniteRPR liu2011infinite is an example of a modelfree approach. The InfinitePOMDP doshi2015bayesian is an example of an modelbased solution method that attempts to do it in a modelbased fashion as well. Their approach is similar in the sense that they learn a model in a Bayesian matter, however their assumptions of prior knowledge and about what is being learned are different.
6. Conclusion
This paper addresses the void for Bayesian Reinforcement Learning models and methods at the intersection of factored models and partially observable domains. Our approach describes the dynamics of the POMDP in terms of graphical models and allows the agent to maintain a joint belief over the state, and both the graph structure and CPT parameters simultaneously. Alongside the framework we introduced FBAPOMCP, a solution method, which consists of an extension of MonteCarlo Tree Search to FBAPOMDPs, in addition to a particle reinvigorating belief tracking algorithm. The method is guaranteed to converge to the optimal policy with respect to the initial, as both the planner and the belief update are unbiased. Lastly we compared it to the current stateoftheart approach on the different domains. The results show the significance of representing and recognizing independent features, as our method either outperforms BAPOMDP based agents or is able to learn in scenarios where tabular methods are not feasible at all.
Acknowledgements
Christopher Amato and Sammie Katt are funded by NSF Grant #1734497, Frans Oliehoek is funded by EPSRC First Grant EP/R001227/1, and ERC Starting Grant #758824—INFLUENCE.
References
 (1) Auer, P., CesaBianchi, N., and Fischer, P. Finitetime analysis of the multiarmed bandit problem. Machine learning 47, 23 (2002), 235–256.
 (2) Baxter, J., and Bartlett, P. L. Direct gradientbased reinforcement learning. In Circuits and Systems, 2000. Proceedings. ISCAS 2000 Geneva. The 2000 IEEE International Symposium on (2000), vol. 3, IEEE, pp. 271–274.

(3)
Boutilier, C., and Poole, D.
Computing optimal policies for partially observable decision
processes using compact representations.
In
Proceedings of the National Conference on Artificial Intelligence
(1996), Citeseer, pp. 1168–1175.  (4) Chen, Y. Another look at rejection sampling through importance sampling. Statistics & probability letters 72, 4 (2005), 277–283.
 (5) DoshiVelez, F., Pfau, D., Wood, F., and Roy, N. Bayesian nonparametric methods for partiallyobservable reinforcement learning. IEEE transactions on pattern analysis and machine intelligence 37, 2 (2015), 394–407.
 (6) Gordon, N. J., Salmond, D. J., and Smith, A. F. Novel approach to nonlinear/nongaussian bayesian state estimation. In IEE Proceedings F (Radar and Signal Processing) (1993), vol. 140, IET, pp. 107–113.
 (7) Hausknecht, M., and Stone, P. Deep recurrent qlearning for partially observable mdps. CoRR, abs/1507.06527 7, 1 (2015).

(8)
Heckerman, D., Geiger, D., and Chickering, D. M.
Learning bayesian networks: The combination of knowledge and statistical data.
Machine learning 20, 3 (1995), 197–243.  (9) Igl, M., Zintgraf, L., Le, T. A., Wood, F., and Whiteson, S. Deep variational reinforcement learning for pomdps. arXiv preprint arXiv:1806.02426 (2018).
 (10) Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. Planning and acting in partially observable stochastic domains. Artificial intelligence 101, 1 (1998), 99–134.
 (11) Katt, S., Oliehoek, F. A., and Amato, C. Learning in pomdps with monte carlo tree search. In International Conference on Machine Learning (2017), pp. 1819–1827.
 (12) Lee, D. S., and Chia, N. K. A particle algorithm for sequential bayesian parameter estimation and model selection. IEEE Transactions on Signal Processing 50, 2 (2002), 326–336.

(13)
Liang, F., Liu, C., and Carroll, R.
Advanced Markov chain Monte Carlo methods: learning from past samples
, vol. 714. John Wiley & Sons, 2011.  (14) Liu, M., Liao, X., and Carin, L. The infinite regionalized policy representation. In Proceedings of the 28th International Conference on Machine Learning (ICML11) (2011), Citeseer, pp. 769–776.

(15)
Liu, M., Liao, X., and Carin, L.
Online expectation maximization for reinforcement learning in pomdps.
In IJCAI (2013), pp. 1501–1507.  (16) Luo, Y., Bai, H., Hsu, D., and Lee, W. S. Importance sampling for online planning under uncertainty. In Workshop on Algorithmic Foundations of Robotics (2016).
 (17) Martino, L., Read, J., and Luengo, D. Independent doubly adaptive rejection metropolis sampling within gibbs sampling. IEEE Trans. Signal Processing 63, 12 (2015), 3123–3138.
 (18) McCallum, A. K., and Ballard, D. Reinforcement learning with selective perception and hidden state. PhD thesis, University of Rochester. Dept. of Computer Science, 1996.
 (19) Rabiner, L. R., and Juang, B.H. An introduction to hidden markov models. ieee assp magazine 3, 1 (1986), 4–16.
 (20) Robert, C., and Casella, G. Monte Carlo statistical methods. Springer Science & Business Media, 2013.
 (21) Ross, S., and Pineau, J. Modelbased bayesian reinforcement learning in large structured domains. In Uncertainty in artificial intelligence: proceedings of the Conference on Uncertainty in Artificial Intelligence (2008), vol. 2008, NIH Public Access, p. 476.

(22)
Ross, S., Pineau, J., Chaibdraa, B., and Kreitmann, P.
A bayesian approach for learning and planning in partially observable markov decision processes.
The Journal of Machine Learning Research 12 (2011), 1729–1770.  (23) Silver, D., and Veness, J. MonteCarlo planning in large POMDPs. In Advances in Neural Information Processing Systems (2010), pp. 2164–2172.
 (24) Thrun, S. Monte Carlo POMDPs. In NIPS (1999), vol. 12, pp. 1064–1070.
 (25) Tierney, L. Markov chains for exploring posterior distributions. the Annals of Statistics (1994), 1701–1728.
 (26) Zhu, P., Li, X., Poupart, P., and Miao, G. On improving deep reinforcement learning for pomdps. arXiv preprint arXiv:1804.06309 (2018).
Comments
There are no comments yet.