Sequential decision making under uncertainty is both critical and challenging. Partially Observable Markov Decision Processes (POMDPs) are the general and systematic frameworks for computing such decision making problems. Although finding the optimal strategies under the POMDP framework is computationally intractable, advances have been made in computing approximately optimal strategies[2, 3, 4, 5]. In fact, we now have algorithms that can find near optimal strategies within reasonable time and have been applied to solve various realistic robotics problems (e.g., [6, 7, 8]).
With POMDP solving becoming practical and POMDPs starting to be used in practice, the problem of generating a good POMDP model for a given problem becomes increasingly important. A POMDP model is defined by six components: The states the system might be in, the actions it can perform, the observations it can perceive, system dynamics which represent uncertainty in the effect of actions, an observation function which represents uncertainty in sensing, and a reward function from which the objective function is derived. While the first three components are easy to define, the last three are more difficult due to uncertainty in the system and imperfect or even non-existent measurements to assess them.
Many machine learning techniques have been proposed to alleviate the above difficulty[9, 10]. They can be divided into two broad classes. First is model-free, where the system learns a direct mapping from environment information to strategies, bypassing model generation. Second is model-based, where the system first learns the model, and strategies are generated by applying model-based planning techniques to the learned model.
Recently, deep neural networks have been proposed to combine model-free learning and model-based planning[1, 11]. These works learn a direct mapping from environment information to strategies. However, internally these methods learn a POMDP model (or in the case of , an MDP model —a sub-class of POMDP where states are fully observable) and use a planning module, embedded inside the neural network, to generate the strategy. The objective of the model learning component here is not to generate the most accurate model, but rather to generate a useful approximate model that will maximise policy performance when used together with the embedded planning algorithm. The results have been promising. However, they assume the system dynamics —aka, transition function— to be the same everywhere, regardless of the geometry of the underlying environment, which often limits the expressiveness of the model and restricts the effectiveness of planning.
To relax the above assumption, we propose a novel neural network architecture, called TransNet. Key to TransNet is a differentiable neural network module that learns non-uniform transition dynamics efficiently by assuming that states with similar local characteristics have similar dynamics. This module divides the state space into classes, where each class corresponds to a unique transition function. The transition probabilities for each class are then represented by the channels of a kernel in a convolution layer. This technique allows distinct transition dynamics to be applied to states with different local characteristics while still allowing the use of existing efficient implementations of convolutional network layers. TransNet uses this novel neural network module together with the overall architecture of state-of-the-art QMDP-Net to solve POMDPs with a priori unknown model and non-uniform transition dynamics.
Simulation experiments on various navigation benchmark problems with and without dynamic elements indicate that compared to QMDP-Net, TransNet requires substantially less training time and data to produce policies with better quality: In some cases, TransNet uses less than 20% of the training data used by QMDP-Net to generate policies with similar quality. Our results also indicate that TransNet provides substantially better generalization capability than QMDP-Net.
2.1 POMDP Framework
Formally, a POMDP is described by a 7-tuple , where is the set of states, is the set of actions, and is the set of observations. At each step, the agent is in some hidden state , takes an action , and moves from to another state
according to a conditional probability distribution, called the transition probability. The current state is then partially revealed via an observation drawn from a conditional probability distribution that represents uncertainty in sensing. After each step, the agent receives a reward , if it takes action from state .
Due to the uncertainty in the effect of action and in sensing, the agent never knows its exact state. Instead, it maintains an estimate of its current state in the form of abelief , which is a probability distribution function over . At the end of each step, the agent updates its belief in a Bayesian manner, based on the belief at the beginning of the step along with the action and observation that have just been performed and perceived in this step.
The objective of a POMDP agent is to maximize its expected total reward —called value function—, by following the best policy/strategy at each time step. A policy is a mapping from beliefs to actions. Each policy induces a value function for any , which is computed as:
The notation represents the new belief of the agent after it performs action and perceives observation afterwards. It is computed as ( is a normalizing factor). When the planning horizon is infinite, to ensure the problem is well defined, rewards at subsequent time steps are discounted by a constant factor . The best policy is one that maximizes the value function at each belief .
2.2 Related Work
Recently, there has been a growing body of works that apply model-free deep learning to solve large scale POMDPs when the model is not fully known. For instance, implemented a variation of DQN  which replaces the final fully connected layer with a recurrent LSTM layer to solve partially observable variants of Atari games. The work in 
applied convolutional neural networks with multiple recurrent layers for the task of navigating within a partially observable maze environment. The learned policy is able to generalise to different goal positions within the learned maze, but not to previously unseen maze environments.
More recently, greater success have been achieved with methods that embed specific computational structures representing a model and algorithm within a neural network and training the network end-to-end, a hybrid approach which has the potential to combine the benefits of both model-based and model-free methods. For instance,  developed a differentiable approximation of value iteration embedded within a convolutional neural network to solve fully observable Markov Decision Process (MDP) problems in discrete space, while  implemented a network with specific embedded computational structures to address the problem of path integral optimal control with continuous state and action spaces. These works focus only on cases where the underlying state is fully observable.
By combining the ideas in the above work with recent work on embedding Bayesian filters in deep neural networks[17, 18, 19], one can develop neural network architectures that combine model-free learning and model-based planning for POMDPs. For instance,  implemented a network which implements an approximate POMDP algorithm based on  by combining an embedded value iteration module with an embedded Bayesian filter. Modules are trained separately, with a focus on learning transition and reward models over directly learning a policy.
More recently,  developed QMDP-Net, which implements a approximate POMDP algorithm to predict approximately optimal policies for tasks in a parameterised domain of environments. Policies are learned end-to-end, focusing on learning an “incorrect but useful” model which learns to optimise policy performance over model accuracy. However, the embedded model is restricted to using a simple transition model which assumes all states have the same transition dynamics. The transition function is represented as a kernel whose depth is the same as the size of the action space. The same learned kernel is applied to each state in the state space. This representation of the transition function enables the dynamics learned for one state to be generalised to other states, reducing the amount of training data needed to learn transition dynamics for all states. But as a result, QMDP-Net cannot represent non-uniform transition dynamics. TransNet relaxes this restriction, while maintaining data efficiency.
TransNet learns a near optimal policy end-to-end, for acting in a parameterized set of partially observable scenarios: , where is the set of all possible parameter values. Each parameter describes properties of the scenarios such as obstacle geometry and materials, position of static and dynamic obstacles, goal location, and initial belief distribution for a given task and environment. TransNet assumes that the problems of deciding how to act in the various scenarios in are defined as POMDPs with a common state space , action space and observation space but without a priori known transition, action, and observation functions. TransNet learns the parameterized transition, observation, and reward functions suitable to generate a good policy for the set of scenarios in , as it learns the policy.
Similar to QMDP-Net, TransNet’s overall structure is a Recurrent Neural Network with two interleaving blocks: Planning and Belief update. Figure1(a) illustrates this network. However unlike QMDP-Net, in each block, TransNet uses a neural network module as described in the following subsection to learn a transition function that depends on both actions and local characteristics of the states, rather than actions alone, thereby allowing more expressive POMDP models to be learnt, while maintaining data efficiency.
3.1 Learning Non-Uniform Transition Dynamics
Key to TransNet is a neural network module for learning the transition function of a set of parameterized POMDPs. Suppose is the POMDP problem that corresponds to a scenario . To learn the transition function , the neural network module represents by a combination of a learned kernel and a classification function. The classification function is a surjection that maps each state to a class index, based on features of the parameter . The kernel represents the probability of transitioning into each of the states in a local neighbourhood for each action and each class, with separate channels representing different pairs of actions and classes. The pair of action and class index is then used to select the suitable kernel channel.
Two properties are desirable for the classification function. First, states with similar local characteristics should map to the same class, and states with highly dissimilar characteristics should map to different classes. Second, the number of distinct state classes produced by the classification should be large enough to represent the important distinct modes of the transition dynamics, but small enough to ensure that information learned about the dynamics of one state is allowed to generalise to as many other appropriate states as possible.
To generate the above desirable properties, in this work, the classification function is constructed by selecting a number of features of the scenario parameter which correspond to the local features. The classification function then maps each state to a class index based on the combination of feature values of the state . Let be the number of features and … be the values of the features of state . The classification function is:
where is the maximum value of any feature of any state . The class index of state indicates the transition model to use at . We denote the image of this function, which represents the set of possible state classes, as .
As an example, in a 2D robot navigation problem where includes an image indicating whether each cell in the environment is an obstacle (represented by 1) or free space (represented by 0), the features can be selected to be the values of the cells to the north, south, east and west of the current cell based on this image. The function is then defined as . When a state is blocked by obstacles in only its north and east side for instance, . Of course, the image does not have to be binary. It may also represent information such as terrain types, obstacle types with different elasticity, areas of the environment which are subject to change over time, etc., allowing this representation to generalise to a wide range of scenarios.
To avoid creating a bottleneck in the network, the classification function is implemented as a matrix operation in existing tensor libraries, allowing an image representing the state classification of every state in the state space to be computed efficiently for all states at once. Furthermore, a one-hot mapping is applied to the output of this function, which is then used to index into the channel corresponding to the local characteristics of each state using efficient matrix multiplication and summation operations. An illustration of TransNet for a problem where the state spaceconsists of two state variables, whose size is and , respectively, is shown in Figure 1(b).
The above manual selection of features and algorithmic classification could be replaced by an additional convolutional neural network, allowing important features which influence transition dynamics to be learned adaptively.
Now, one may question whether the same results can be achieved if we simply augment the state space with a state variable that indicates the class index of the original states, and then applying QMDP-Net to the augmented state space as is. The answer is negative. Let’s set aside the computational complexity issue caused by the substantially enlarged state space. Since the transition function that QMDP-Net learns does not depend on states, this strategy of augmenting the state space will still “force" a single transition function to be used for all states, including for states in different classes. In contrast, TransNet learns multiple transition functions —one function for each class.
Note also that this module is general enough that it can be combined with any neural network architecture that embed POMDP/MDP planning with initially unknown transition function. However, TransNet combines this module with QMDP-Net and embeds the module within every planning and belief update block. The following two subsections provide more details on this embedding.
The planning component of TransNet consists of a repeating block structure in which each block represents a single step of value iteration and blocks can be stacked to arbitrary depth to produce any desired planning horizon. Each block takes as input a value image , and produces as output updated values based on one additional planning step, , with the input to the first block, , taken from the prediction of the immediate reward associated with each provided by .
TransNet convolves the input with the neural network module for learning transition function. This module has one output channel for each pair , where and . The result of the convolution is a layer that represents the Q-values for each combination of state, action and class index. Since for any scenario with parameter , is a surjection, we only need to select Q-values for the class that matches with . Therefore, the Q-values are multiplied with the one-hot representation of the state class image, before being summed over the axis corresponding to . This has the effect of selecting the correct Q-values for the current , and discarding all other invalid Q-values. These corrected Q-values are re-weighted by the belief. The maximum of these corrected Q-values over all
is then selected via a max-pooling layer to produce the updated value. The architecture of this block is illustrated in Figure 2(a).
This implementation is a compromise, which sacrifices space complexity efficiency by computing and temporarily storing Q-values for classes which do not match in order to facilitate the use of existing highly optimised implementations of convolutional network layers, without which training the network is infeasible.
3.3 Belief Update
A POMDP agent maintains a belief, which is updated at each time step using a Bayesian filter. To this end, TransNet interleaves the planning block with the belief update block. The belief update block takes a prior belief , action and observation as input, and produces the updated belief as output, which is stored as the prior belief for the next action selection.
To compute , TransNet convolves with the neural network module for learning transition function. The resulting convolution is an image with one channel for each pair , where and , representing the updated probability of being in each state for each combination of action and class index. The one-hot representation of the classes is used to select only the values for which class matches . A one-hot representation of the action applied at time is then used to select the values for which action matches . The resulting belief represents the belief after accounting for the effect of the transition dynamics, notated as . A one-hot representation of the received observation is used to index into the observation model image predicted by to produce an image indicating the predicted probability of receiving for each state . Finally this is used to weight to produce the complete updated belief image, . The architecture of a belief update block is shown in Figure 2(b).
4 Computational Complexity
A key challenge in allowing non-uniform transition dynamics to be represented in a neural network structure is the complexity in terms of the number of trainable weights. By using classification, TransNet significantly reduces this complexity.
The number of trainable weights of TransNet is , where is the size of the action space, is the number of distinct state classifications, and is a constant that represents the size of the convolution filter for a single action and class. This constant is in general much smaller than the size of the state space. In the experiments presented in Section 5, were found to be practical and provide a good trade-off between cost and performance.
The relatively low complexity would persist if the classification function based on manually selected features was replaced by an additional neural network. Suppose a network with a single convolutional layer with kernel width and a single fully connected layer with
hidden neurons is used to assign a classification to each state. The number of weights in this network will be . Combined with the learned transition probabilities for each class, the total number of weights used by the transition model is , giving complexity in terms of the number of trainable weights of , a reduction by a factor of approximately .
5.1 Experimental Setup
To understand the practical performance of TransNet, we compared TransNet with state-of-the-art QMDP-Net. TransNet’s results are based on an implementation developed on top of the software released by the QMDP-Net authors, while QMDP-Net results are based on their released code.
Both networks are trained via imitation learning using the same set of expert trajectories, with the expert trajectories generated by applying thealgorithm to manually constructed ground-truth POMDP models. Only trajectories where the expert was successful were included in the training set. The networks interact only with the expert trajectories and not with the ground-truth model. All hyper-parameters for both networks are set to match those used in the QMDP-Net experiments.
Training was conducted using CPU only on a machine with Intel Core-i7 7700 processor and 8GB RAM. We tested the networks on four domains:
Gridworld Navigation: A robot navigation problem in a general 2D grid setting with noisy state transitions and limited observations. The robot is given a map of obstacle positions, a specified goal location, and initial belief distribution. The robot must localise itself and navigate to the goal. At each time step, the robot selects a direction to move in, and receives a noisy observation indicating whether an obstacle is present in each of the “north”, “south”, “east” and “west” directions. The obstacle configuration is generated uniformly at random, with the constraint imposed that all non-obstacle cells are mutually reachable via some path.
Maze Navigation: Similar to the gridworld navigation task, but with obstacle configuration generated using randomized Prim’s algorithm. This results in expert trajectories typically being longer than in the general grid domain requiring longer term planning. This environment is also highly dependent on the planner’s ability to identify dead-end passages.
Dynamic Maze Environments: A navigation problem in a maze environment with structure that mutates during run-time in a way which qualitatively affects the optimum policy, designed to measure the robustness of a policy to dynamic environments.
A maze is initially constructed using randomized Prim’s algorithm. The maze is divided into 2 partitions, with 2 cells from the border selected to be gates. At each time step, exactly one gate is open and the gates will swap from open to closed and vice versa with certain probability. The start and goal position are selected such that a gate swap will cause the optimum solution to be qualitatively changed. Figure 3 illustrates an example. Two variations of this scenario are evaluated:
V1: The network is trained using only expert trajectories from the static maze navigation task. The environment image provided in shows only the positions of current free spaces and current obstacles, without special marking for open or closed gates.
V2: The network is trained using trajectories based on an expert which plans on a dynamic ground truth POMDP model, allowing the expert to decide whether to wait for a nearby closed gate to open. The environment image received by the agent denotes the position of the gate which is currently open. This may allow the agent to learn to intelligently decide whether to move or wait for the currently open gate to change. The closed gate is not represented in the image.
Large Scale Realistic Environments: A navigation problem in realistic environments modelled on the LIDAR maps from the Robotics Data Set Repository  with noisy actions and limited, unreliable observations. The network is trained on a set of randomly generated stochastic grid environments, with the resulting policy then applied to the realistic environments, which have dimensions in the order of .
5.2 Results and Discussion
|Converged Policy||Policy after Similar Training Time|
|Grid 10x10 D||Expert||95.0||7.4||0.0|
|Grid 10x10 S||Expert||98.0||15.5||6.8|
|Maze 9x9 S||Expert||88.4||15.5||10.5|
|Dynamic Maze V1 9x9 S||Expert||85.2||23.3||13.1|
|Dynamic Maze V2 9x9 S||Expert||89.8||19.2||11.8|
Table 1 presents comparisons on the success rate, average number of steps, and collision rate of executing the policies generated by QMDP-Net and by TransNet. Training is conducted until convergence, but policies are outputted at a regular interval of 50 epochs. Training uses 10,000 different scenarios, comprising of 2,000 different environments and 5 different trajectories per environment. Policy evaluation is conducted on 500 different scenarios, comprising of 100 new environments and 5 different trajectories per environment.
The results indicate that TransNet consistently produced substantially better policies than QMDP-Net and out-performs the training expert trajectories more consistently than QMDP-Net. The left side of Table 1 presents the results when training is run until convergence and comparison with the expert trajectory. In most cases, the number of epochs required to achieve convergence is lower in TransNet than in QMDP-Net. Moreover, compared to QMDP-Net, TransNet converges to policies with better quality. The right side of Table 1 presents the results where the training time are similar, giving slightly longer time to QMDP-Net. They indicate that although TransNet requires more training time per epoch than QMDP-Net, TransNet uses less time to generate policies with better quality.
The results also demonstrate TransNet is significantly more robust than QMDP-Net in dynamic environments. The success rate and collision rate of TransNet are not substantially degraded by the introduction of dynamic environment elements, and performance remains at or above the level of the QMDP expert trajectories.
|Intel Lab 101x99 D||QMDP-Net||40.0||100.0||6.6|
|Intel Lab 101x99 S||QMDP-Net||4.0||90.0||37.2|
|Building 079 145x57 D||QMDP-Net||56.0||70.8||22.5|
|Building 079 145x57 S||QMDP-Net||24.0||122.3||43.0|
|Hospital 193x104 D||QMDP-Net||14.0||85.1||28.6|
|Hospital 193x104 S||QMDP-Net||24.0||119.5||28.6|
Table 3 presents a comparison of the performance of TransNet and QMDP-Net in a stochastic grid environment when trained on sets of expert trajectories of different sizes.
The results indicate TransNet significantly reduces data requirements. TransNet achieves a success rate after training with 2,000 scenarios. In contrast, QMDP-Net requires 50,000 scenarios to attain a comparable rate of success in this domain. The reduced data requirements enable TransNet to be more practical for applications where acquiring training data is difficult or costly, such as when training data must be collected through interaction with a physical system.
Table 3 presents the generalization capability of TransNet, compared to QMDP-Net. It compares the performance when networks trained on small artificially generated environments are evaluated on large scale realistic environments: Intel Lab corresponds to the Intel Research Lab dataset, Building 079 corresponds to the Freiburg Building 079 dataset, and Hospital corresponds to the Freiburg University Hospital dataset. To evaluate scenarios, we ran 25 trials per environment. In the work of , QMDP-Net was demonstrated to produce high rates of success on deterministic large scale environments when trained on expert trajectories in random grids. Here, we trained both TransNet and QMDP-Net on random grids and evaluated in both deterministic and stochastic cases of realistic environments.
The results indicate TransNet substantially improves generalization capability. Local characteristics of states in the same class of problems (e.g., robot navigation in partially observed scenarios) tend to remain the same, even though the global complexity are totally different. Therefore, by learning separate transition functions based on local characteristics of the states, TransNet can generate policies that generalized well.
We present the learned transition function for Grid10X10S in Supplementary-1. In summary, the transition learned is as expected.
6 Conclusion and Future Work
TransNet is a deep recurrent neural network for computing near optimal POMDP policies when the transition, observation, and reward functions are a priori unknown. The key novelty of TransNet is a relatively simple neural network module that can learn non-uniform transition function efficiently. Experiments on navigation benchmarks indicate that TransNet consistently out-performs state-of-the-art QMDP-Net. Moreover, results also indicate that TransNet can generalize better and substantially reduce the amount of training data and time required to reach certain performance.
This work suggests that a relatively simple neural network module can help embed more sophisticated models into deep neural networks, which then lead to substantial improvement for planning in stochastic domain. It is interesting to understand further how more sophisticated planning and learning components could help further scaling up of our capability in computing near optimal policies for decision making in stochastic domain.
-  Peter Karkus, David Hsu, and Wee Sun Lee. QMDP-net: Deep learning for planning under partial observability, 2017.
-  H. Kurniawati, D. Hsu, and W.S. Lee. SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems, 2008.
-  J. Pineau, G. Gordon, and S. Thrun. Point-based Value Iteration: An anytime algorithm for POMDPs. In IJCAI 2013, 2003.
-  D. Silver and J. Veness. Monte-Carlo planning in large POMDPs. In Proc. Neural Information Processing Systems, 2010.
-  A. Somani, N. Ye, D. Hsu, and W.S. Lee. DESPOT: Online POMDP Planning with Regularization. In Proc. Neural Information Processing Systems. 2013.
-  Maxime Bouton, Akansel Cosgun, and Mykel J Kochenderfer. Belief state planning for autonomously navigating urban intersections. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 825–830. IEEE, 2017.
-  M. Chen, S. Nikolaidis, H. Soh, D. Hsu, and S. Srinivasa. Planning with trust for human-robot collaboration. In Proc. ACM/IEEE Int. Conf. on Human-Robot Interaction, 2018.
-  Marcus Hoerger, Hanna Kurniawati, and Alberto Elfes. POMDP-based candy server: Lessons learned from a seven day demo. In Proc. Int. Conference on Automated Planning and Scheduling (ICAPS), 2019.
Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al.
Bayesian reinforcement learning: A survey.Foundations and Trends® in Machine Learning, 8(5-6):359–483, 2015.
-  Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6):26–38, 2017.
Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel.
Value iteration networks.
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Aug 2017.
-  Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
-  Matthew Hausknecht and Peter Stone. Deep recurrent Q-learning for partially observable MDPs, 2015.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529 EP, Feb 2015.
-  Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J. Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, Dharshan Kumaran, and Raia Hadsell. Learning to navigate in complex environments, 2016.
-  Masashi Okada, Luca Rigazio, and Takenobu Aoshima. Path integral networks: End-to-end differentiable optimal control, 2017.
-  Rico Jonkowski and Oliver Brock. End-to-end learnable histogram filters, 2017.
-  Tuomas Haarnoja, Anurag Ajay, Sergey Levine, and Pieter Abbeel. Backprop KF: Learning discriminative deterministic state estimators, 2016.
-  Peter Karkus, David Hsu, and Wee Sun Lee. Particle filter networks with application to visual localization, 2018.
T. Shankar, S. K. Dwivedy, and P. Guha.
Reinforcement learning via recurrent convolutional neural networks.
2016 23rd International Conference on Pattern Recognition (ICPR), pages 2592–2597, Dec 2016.
-  Michael L. Littman, Anthony R. Cassandra, and Leslie Pack Kaelbling. Learning policies for partially observable environments: Scaling up. In ICML, 1995.
-  Andrew Howard and Nicholas Roy. The robotics data set repository (radish), 2003.