DeepAI
Log In Sign Up

Towards Data-Driven Offline Simulations for Online Reinforcement Learning

11/14/2022
by   Shengpu Tang, et al.
Microsoft
0

Modern decision-making systems, from robots to web recommendation engines, are expected to adapt: to user preferences, changing circumstances or even new tasks. Yet, it is still uncommon to deploy a dynamically learning agent (rather than a fixed policy) to a production system, as it's perceived as unsafe. Using historical data to reason about learning algorithms, similar to offline policy evaluation (OPE) applied to fixed policies, could help practitioners evaluate and ultimately deploy such adaptive agents to production. In this work, we formalize offline learner simulation (OLS) for reinforcement learning (RL) and propose a novel evaluation protocol that measures both fidelity and efficiency of the simulation. For environments with complex high-dimensional observations, we propose a semi-parametric approach that leverages recent advances in latent state discovery in order to achieve accurate and efficient offline simulations. In preliminary experiments, we show the advantage of our approach compared to fully non-parametric baselines. The code to reproduce these experiments will be made available at https://github.com/microsoft/rl-offline-simulation.

READ FULL TEXT VIEW PDF

page 12

page 13

06/27/2022

When to Trust Your Simulator: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning

Learning effective reinforcement learning (RL) policies to solve real-wo...
11/29/2020

Offline Reinforcement Learning Hands-On

Offline Reinforcement Learning (RL) aims to turn large datasets into pow...
06/21/2022

Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning

We study offline meta-reinforcement learning, a practical reinforcement ...
07/01/2021

Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble

Recent advance in deep offline reinforcement learning (RL) has made it p...
12/16/2020

Batch-Constrained Distributional Reinforcement Learning for Session-based Recommendation

Most of the existing deep reinforcement learning (RL) approaches for ses...
10/18/2020

DeepAveragers: Offline Reinforcement Learning by Solving Derived Non-Parametric MDPs

We study an approach to offline reinforcement learning (RL) based on opt...

1 Introduction & Related Work

Exploration is one of the central problems of RL and has been studied primarily assuming access to an online environment (sutton2018). Offline RL, the problem of learning a policy from a previously collected experience history, has gained a lot of attention in the recent years (offlineRLtutorial; fu2020d4rl). Yet the combination of the two, i.e., reasoning about exploration and agent’s learning process given just an offline dataset, though a problem of tremendous potential value, has not been covered extensively by the RL research community. Offline policy evaluation (OPE), a subproblem of offline RL, focuses on evaluating performance of a fixed policy using an offline dataset (Figure 1). In many real-world scenarios, including recommender systems, personalizable web services, robots required to adapt to new tasks, etc., instead of having fixed policies, we would like the agent to continue learning after deployment. This requires the agent to explore and react to its experience in the environment by adapting its policy. OPE ignores these factors and is not the right framework to assess such agents. In this work we propose offline learner simulation (OLS) as a way to evaluate non-stationary agents given just an offline dataset.

A natural approach to simulate learners can be to leverage model-based RL (also used successfully as an OPE method - see fu2021benchmarks) - by learning a world model on the offline dataset, and then using that model to generate new rollouts. While simple to use, the learned model incurs bias that may be hard to measure and reason about. On the other side of the spectrum, we have non-parametric approaches replaying data in certain ways to match the true environment’s distribution: john2011 discussed this approach in the contextual bandit setting, and mandel2016offline

extended the idea to Markov decision processes (MDPs). These methods are provably unbiased and allow for simulating a learning process with provable absolute accuracy, but become inefficient for all but the simplest toy problems. More realistic environments, with rich observations and stochastic transitions, would lead to simulation terminating after few steps, deeming these methods impractical.

In this work, we incorporate recent advances in latent state discovery (du2019provably; misra2020kinematic; acstate2022), which allow one to recover the unobserved latent states from potentially high-dimensional rich observations, with the model-free data-driven approaches proposed in mandel2016offline, to improve their efficiency and practicality. In the preliminary experiments, we evaluate the methods by comparing the obtained simulations to the true online learners in terms of fidelity and efficiency. We show that newly proposed methods are able to simulate a learning process with high fidelity, and are capable of producing longer simulations than fully non-parametric approaches.

2 Problem Setup

We consider reinforcement learning (RL) in block Markov decision processes (Block MDPs), defined by a large (possibly infinite) observation space , a finite unobservable state space , a finite action space , transition function , emission function , reward function , initial state distribution , and discount factor . A policy specifies a distribution over actions for each observation. In RL, a learner (often realized by executing a learning algorithm), denoted by , defines a mapping from some history of interactions of arbitrary length to a policy . Here, a history of interactions of length is an ordered sequence of transition tuples , and is the set of all histories. Consider the interaction cycle between the learner and the environment: at step , the learner has seen its interactions with the environment during steps and makes use of the history so far to define its policy for subsequent interaction(s). Overall, the learner follows a non-stationary policy where, importantly, the sequence of policies that constitute this non-stationary policy is not known in advance.

Fig. 1: OPE vs OLS

In this work, we consider the problem of offline learner simulation (OLS), which is used to gain an understanding of what a learner would “perform” in the real environment; we might be interested in how the learner would gather data and explore the environment, or how quickly the learner would converge to the optimal policy. Given a logged dataset of past interactions , in order to simulate a (black-box) learner up to step , it is necessary to provide the learner with a history that is drawn from a distribution identical to the one observed if the learner interacted with the real environment

. This is in contrast to OPE, where it is usually sufficient to obtain a value function estimate (

Figure 1).

2.1 Evaluation Protocol

A “good” offline simulation should run as accurately as possible for as long as possible. Therefore, we propose to quantify the success of offline simulations via two aspects - efficiency and fidelity.

Efficiency can be measured by the length of histories generated by the simulation before it terminates. While some simulation approaches allow the learner to run indefinitely, this often comes at the cost of large biases. Therefore, we also consider simulation approaches that have the option to “terminate”. In Figure 2, sim2 is more efficient than sim1 because it terminates after more simulation steps. Figure 2: Example of the ground-truth learning curve as well as two offline simulations (details of this experiment are in Section B.1).

To measure fidelity, in theory, we want to compare the distribution of histories generated by the simulation to the real distribution . Since these distributions may be difficult to represent analytically, in practice, we instead use aggregate scalar statistics as proxy measures.111We caution that being close to is only a necessary condition for the being close to , but not a sufficient condition. For example, we may have - the expected policy performance after the learner has received a length- history . Note that the specific choice of is dependent on the learner and the environment; some alternative choices include: state visitation distribution in the history , or the learner’s model parameters. The expectation over distributions of histories can be approximated empirically using averaged results from multiple simulation runs. Suppose we are interested in the fidelity of a simulation from training step to . As shown in Figure 2, we can visualize for as learning curves, and measure the error in simulation as the RMSE between the learning curves from the simulation vs the ground-truth over all steps (alternatively, one may use the mean/max absolute error). For , sim1 is a more “accurate” simulation than sim2, even though sim2 eventually converged to the correct policy after whereas sim1 is not able to run till convergence. Comparisons of fidelity are only meaningful for the same .

Both efficiency and fidelity are important for offline simulation, yet it is not straightforward to define a single metric that captures both. Since there is usually a trade-off between the two (similar to the bias-variance trade-off), in our experiments below, we consider these two aspects separately.

3 Non-parametric & Semi-parametric Offline Simulation

mandel2016offline proposed several non-parametric approaches for OLS in RL including the queue-based evaluator (QBE) and per-state rejection sampling (PSRS). We focus on PSRS because it was shown to be more efficient than QBE. In PSRS, each transition in the logged dataset is only considered once, and is either accepted and given to the learner, or rejected and discarded. This ensures the unbiasedness of the overall simulation. In Section A.1, we restate these two algorithms with a few modifications to allow for more general settings such as episodic problems with multiple initial states and non-stationary logging policies.

A key step in PSRS (as the name suggests) is line 3 (Algorithm 3) which groups transitions in the logged dataset into queues based on the from-state, and the simulation terminates whenever it hits an empty queue. This works reasonably well for tabular MDPs. For block MDPs, a naive approach is to treat the observations as the keys to group transitions by. Since we often have an infinite observation space, every observation is only seen in the logged dataset once, making this approach very inefficient and impractical because the queues would be empty before we are able to simulate long enough. Fortunately, recent advances in latent state discovery (du2019provably; misra2020kinematic; acstate2022) allow one to recover the unobserved latent states from potentially high-dimensional rich observations. For simulating such block MDPs, we propose to first learn a latent state encoder, preprocess the logged dataset into latent states, and perform PSRS while grouping transitions using the latent states (high-level pseudo code shown in Algorithm 1; for more details, see Section A.2). Importantly, the simulation interfaces with learners using raw observations and is thus compatible with the learners that would be used with the original block MDP.

1:Input: Logged dataset recorded by policy , initial observation
2:Input: Learner that maps from history to policy
3:Input: Encoder that maps from observation to latent state
4:Preprocess: Calculate for all
5:Initialize queues[], : group transitions from by latent state , into randomized queues
6:// Start simulation
7:Initialize [ ]
8: initial observation
9:for step to  do
10:      update learner with the history
11:      encode observation
12:     while no transition has been accepted for this step do
13:         if queues[] is empty then terminate          
14:         Sample a transition from queues[]
15:         Perform rejection sampling accept or reject the transition tuple based on similarity between action distributions of the current policy and behavior policy , given observation      
16:     .append() update history with new transition
17:     if episode ends then
18:          start new episode
19:     else
20:               
21:Output: History
Algorithm 1 OLS using Latent Per-State Rejection Sampling: high-level pseudo code

4 Proof of Concept Experiments

In this section, we conducted empirical evaluations of offline simulation for simple block MDPs. First, in a problem with known latent states, we show that using the latent states for offline simulation is more efficient than using the observations, without sacrificing simulation fidelity. Building on this result, we then consider a more challenging scenario where the latent states are not known and must be discovered from data, and demonstrate the effectiveness of our semi-parametric simulation approach even when the learned latent state encoder is not perfect. As the building block for our experiments, we introduce a grid world navigation task (Figure 3-left) modified from zintgraf2019varibad, further described in Appendix B.

Figure 3: Grid-world with discrete observations. Left: latent state space and action space. Observations consist of the latent state concatenated with 4 random bits. Middle: estimated value of starting state for real Q-learning and two simulations (using observations and using latent states directly). Right: fidelity and efficiency of the simulations. Both simulations have perfect fidelity, but using latent states allows for simulating longer.

Grid-World with Discrete Observations. In this setting, we use a discrete observation space induced by the emission function , where an observation is made up of the underlying state and bits that are stochastically sampled at every time step. We first collected 1,000 episodes following a uniformly random policy, and then used PSRS on the logged dataset to simulate tabular Q-learning. We compared PSRS with the observations vs PSRS with the underlying states , and in each case, we repeated the simulation procedure for 100 runs with different random seeds. We show results for , where the observation space is 16 times larger than the state space. Figure 3-middle shows the learning curves for all runs, where we track the estimated value of the initial state as the function, since we are in the tabular setting and have transparency over the learner’s internal parameters. In Figure 3-right we show the aggregate result for fidelity and efficiency. Both are accurate simulations as their learning curves are both overlapping exactly with real Q-learning, but using latent states is more efficient than using raw observations and led to simulations about twice as long for this problem. Additional variations of this experiment are explored further in Section B.2.

(a) (b)
Simulation Efficiency () Fidelity ()
Real 0
PSRS-oracle 17 0.173
PSRS-encoder 16 0.203
PSRS-obs-only 50 1.148
PSRS-act-only 50 1.221
PSRS-random 50 1.514
Figure 4: Grid world with continuous observations: experiment results. Left: state visitation distribution and learned latent states. Right

: quantitative results comparing efficiency (median simulation length in epochs, higher is better) and fidelity (RMSE of validation performance curves compared to PPO in a real environment, lower is better).

Figure 5: Learning curves of a PPO agent in grid world with continuous observations: “Real” and various PSRS simulations. Left: learning performance, i.e., average episode return within each training epoch. Right: validation performance, i.e., average episode return as measured in a real validation environment, after each training epoch. All results are averaged over 10 runs. OLS using latent PSRS, denoted here as “PSRS-encoder” faithfully reconstructs learning curves of the real, online PPO agent.

Grid-World with Continuous Observations. Next, we consider a more complex observation space, where the observation is a randomly sampled 2D coordinate within the grid cell of state (normalized to be between 0 and 1). We similarly collected the logged dataset using a uniformly random policy, leading to the state visitation distribution shown in Figure 4a. Note that since the observation space is now continuous – essentially no two

’s are the same – PSRS on the observation space is no longer practical. Therefore, we trained a neural network encoder for discrete latent states for kinematic inseparability abstraction, using the contrastive estimation objective similar to

misra2020kinematic. While we used a latent dimension of 50, it ended up learning only 20 discrete latent states as shown in Figure 4b. We subsequently used the learned state encoder in offline simulation of a PPO agent using PSRS and compared to several baselines. We visualized the learning curves (Figure 5) by tracking policy performance within each training epoch (Figure 5-left) and in a validation environment at the end of each epoch, obtained via averaging the returns from 10 Monte-Carlo rollouts in the true environment (Figure 5-right). Summarized numerical results on efficiency and fidelity of the validation performance are shown in Figure 4-right. Despite the errors in the latent states from the learned encoder, it performs close to the oracle encoder, and both are close to the online PPO in the real environment. All other baselines, despite being efficient, are far from accurate simulations and have non-negligible error. Overall, this experiment shows promise that for block MDPs, we can learn to encode the observations into latent states and then do offline simulations in an efficient and accurate manner, completely using offline data. Further experimental details, including explanation of the used baselines, are in Section B.3.

5 Conclusion

In this work, we studied offline learner simulation (OLS), which allows one to evaluate an agent’s learning process using offline datasets. We formally described the evaluation protocol for OLS in terms of efficiency and fidelity, and proposed semi-parametric simulation approaches to handle block MDPs with rich observations by extending existing non-parametric approaches and leveraging recent advances in latent state discovery. Through preliminary experiments, we show the advantage of this approach even when the learned latent states are not perfectly correct. Code to reproduce experimental results will be released publicly upon publication of this paper.

Besides applications in recommender systems and robotics, OLS may be especially useful for multi-task and meta-learning settings, where simulation on a subset of tasks may inform us about future adaptive performance on other tasks. It may prove to be a crucial component to succeed in offline meta-RL (OfflineMetaRL; mitchell2021offline; offlineMetaRLWithOnlineSelfSupervision). Future work should also consider removing the assumption on discrete latent topology and accounting for exogenous processes (further discussed in Section B.4) to handle a wider class of problems.

References

Appendix A Algorithms

a.1 Non-parametric Simulation for Tabular MDPs

The QBE (Algorithm 2) and PSRS (Algorithm 3) algorithms are based on mandel2016offline. We made the following two modifications to the versions originally proposed to allow for more general settings such as episodic problems with multiple initial states and non-stationary logging policies.

  • L4 and L6 in both algorithms, L13-15 in QBE and L16-18 in PSRS: to support episodic settings, we initialize a set of possible starting states in the init_queue, and during simulation we obtain a new starting state from this set whenever the episode ends.

  • L11-12 in PSRS: to support non-stationary logging policies, assuming the logged dataset have stored the action distribution of the behavior policy that each observed action is drawn from, we recalculate for each candidate transition in the while loop.

1:Input: Logged dataset
2:Input: Learner that maps from history to policy
3:Initialize queues[] = Queue(RandomOrder s.t. and ),
4:Initialize init_queue = Queue(RandomOrder)
5:// Start simulation
6:Initialize [ ]
7: init_queue.pop()
8:for step to  do
9:      update learner with the history
10:      take an action according to the current policy
11:     if queues[] is empty then terminate      
12:     Sample a transition = queues[].pop()
13:     .append();   update history with new transition
14:     if episode ends then
15:         if init_queue is empty then terminate          
16:          init_queue.pop()      
17:Output: History
Algorithm 2 Tabular Queue-Based Evaluator (modified from Alg 1 in mandel2016offline)
1:Input: Logged dataset where
2:Input: Learner that maps from history to policy
3:Initialize queues[] = Queue(RandomOrder s.t. ),
4:Initialize init_queue = Queue(RandomOrder)
5:// Start simulation
6:Initialize [ ]
7: init_queue.pop()
8:for step to  do
9:      update learner with the history
10:     while no transition has been accepted for this step do
11:         if queues[] is empty then terminate          
12:         Sample a transition = queues[].pop()
13:         Calculate
14:         Sample
15:         if  then reject the sampled transition               
16:     .append();   update history with new transition
17:     if episode ends then
18:         if init_queue is empty then terminate          
19:          init_queue.pop()      
20:Output: History
Algorithm 3 Tabular Per-State Rejection Sampling (modified from Alg 2 in mandel2016offline)

a.2 Semi-parametric Simulation for Block MDPs

In this appendix section we present a more detailed version of Algorithm 1 from the main paper.

We extend Algorithm 3 to support block MDPs by assuming access to an encoder function that maps from observation to its corresponding latent state (recall that under the definition of block MDPs, there is a unique latent state that can emit given observation ). Note that the same extension can be applied to Algorithm 2 but is omitted here. The main parts of Algorithm 4 follows the vanilla PSRS algorthm, with the following key differences (highlighted in magenta):

  • L3: the algorithm requires the latent state encoder as an additional input.

  • L4: we preprocess the observations into corresponding latent states for all transitions in the logged dataset.

  • L5: when grouping transitions into queues, we use the pre-computed latent states as the key instead of the observations.

  • L6: the set of starting observations are taken as is without being converted to latent states.

  • L12-L14: when retrieving from queues, we first compute the latent state of the current observation , and then query the corresponding queue for .

  • L15-L18: the policies still use observations as input, maintaining the agent-environment interface of the original block-MDP that this algorithm is simulating.

1:Input: Logged dataset where
2:Input: Learner that maps from history to policy
3: Input: Encoder that maps from observation to latent state
4: Preprocess: Calculate for all
5:Initialize queues[] = Queue(RandomOrder s.t. ),
6:Initialize init_queue = Queue(RandomOrder)
7:// Start simulation
8:Initialize [ ]
9: init_queue.pop()
10:for step to  do
11:      update learner with the history
12:     
13:     while no transition has been accepted for this step do
14:         if queues[] is empty then terminate          
15:         Sample a transition = queues[].pop()
16:         Calculate
17:         Sample
18:         if  then reject the sampled transition               
19:     .append();   update history with new transition
20:     if episode ends then
21:         if init_queue is empty then terminate          
22:          init_queue.pop()      
23:Output: History
Algorithm 4 Latent Per-State Rejection Sampling

a.2.1 Latent State Encoder

Recent work has proposed various ways of learning the latent state encoder for block MDPs, including (du2019provably), HOMER (misra2020kinematic), PPE (efroni2021provably), AC-State (acstate2022), AIS (subramanian2022approximate). In this work, we adapt the approach used in HOMER, which can learn a latent state abstraction for kinematic inseparability (misra2020kinematic)

, where two observations are combined into a single latent state if and only if (i) they have the same distribution over next observations, and (ii) they have the same (joint) distribution over previous observations and previous actions. Instead of the online interactive version of the HOMER algorithm, we take the state abstraction learning component and train it on offline data. As proven by

misra2020kinematic, the kinematic inseparability abstraction can be learned via a contrastive estimation objective in a supervised classification problem:

where is the latent state encoder that takes a single observation as input,

is a binary classifier for a transition

in the latent space, and the overall objective is minimizing the negative likelihood of a binary classification problem. The noise contrastive distribution is constructed as follows: first randomly draw a transition from the logged dataset, then either or

is produced with probability

, where is from an independent draw from the logged dataset , and indicates whether the transition is real (1) or an imposter (0). Instead of preconstructing the distribution into a full dataset, for efficient implementation during mini-batch learning, we independently sample twice from the logged dataset and include both and in a batch. For more details on the theoretical guarantees, see misra2020kinematic.

Appendix B Experimental Details on Grid-World

Environment description

We consider a navigation task in a grid world (Figure 6), based on zintgraf2019varibad. The discrete state space is represented by the index of each cell, with . The agent starts from the bottom left cell, and the goal is the top right cell. There are actions: stay, up, right, down, left that each deterministically moves the agent to the adjacent cell in that direction (if the action would move agent out of bounds then the agent stays at the current cell). Reward is for reaching the goal and for each step otherwise. There is no cap on maximum episode length.

Figure 6: Grid-world with discrete observations, the latent state space and the action space.

b.1 Illustrative Example of Evaluation Protocol (Figure 2)

We consider an observation space induced by the emission function , where an observation is the latent state augmented by a modulo counter . At every transition, the counter value is incremented by and then modulo , i.e. . The initial observation samples the counter uniformly. This setup is motivated by a common way to construct observations in real-life applications where some cyclical/periodic element is included, such as time of day.

We collected 1,000 episodes using a uniformly random policy in this environment with to create the offline dataset, corresponding to transitions. For simulating learners, we used a model-based simulation strategy, which first builds an environment model via maximum likelihood estimation of the transition dynamics (assuming the reward function is given), and then uses the estimated environment model as the simulator to interact with the agent. We considered two versions of model-based simulation: sim1 builds the environment model in the observation space , and sim2 does so in the latent state space (assuming is recorded in the data) and uses a uniform emission function for where .

For Figure 2, we used the model-based simulations described above for simulating Q-learning with

-greedy exploration using the following hyperparameters: exploration rate

, learning rate , discount . The model-based simulation is repeated for 100 runs using different random seeds, and the results are averaged. This is compared to online Q-learning using the same hyperparameters (also averaged over 100 repeated runs). To illustrate the trade-off between fidelity and efficiency, we truncated the simulation with observation-space model to 50,000 training steps, and the simulation with latent state model to 100,000 steps. We see from Figure 2 that using the observation-space model is not a high-fidelity simulation, since its learning curve starts to deviate from online Q-learning after around 20,000 training steps.

b.2 Grid-World with Discrete Observations

For this experiment, we use a discrete observation space induced by the emission function , where an observation is made up of the underlying state and noisy bits that are stochastically sampled at every time step. For numerical study we considered , corresponding to observation spaces that are times larger than the state space. For illustration, we show which has the clearest trend.

We first collected 1,000 episodes following a uniformly random policy, and then used PSRS on the logged dataset to simulate different “learners”. Depending on the type of learner, we used different function to measure the simulation fidelity. We compared PSRS with the observations and with the underlying states, and in each case, we repeated the simulation procedure for 100 runs with different random seeds. Both simulations are compared to the learner interacting with the real environment, also for 100 repeated runs.

Setting 1
Setting 2
Setting 3
Table 1: Grid-world, observation = latent state + 4 random bits.
Setting 1: Agent performs Monte-Carlo evaluation of a uniformly random policy ().

Here, instead of tracking the progress over many steps, we use the final MC policy value estimate as the

function. Because the evaluated policy is identical to the logging policy, the sampled transition in PSRS is never rejected, and PSRS with latent states can make use of all the samples. However, compared to using the observations, using latent states led to simulations that are roughly two times longer, and produce a tighter confidence interval in the resulting value estimate. Both are unbiased estimate of the groud-truth policy value.

Setting 2: Agent performs Monte-Carlo evaluation of a near-optimal policy ().

We first learned the optimal Q-function on the true environment, and then constructed an -soft policy with . We used the same function as in Setting 1. Because the evaluated learner’s policy is different from the logging policy, some of the sampled transition in PSRS are rejected, and the simulations are much shorter (about of Setting 1). The general trend in Setting 1 still holds.

Setting 3: Agent performs Q-learning with -greedy exploration ().

This setting is discussed in Figure 3. Here, the learner is a tabular Q-learning agent with zero initialization, and uses an -greedy exploration with constant exploration rate of .

Discussion

For the setting with a discrete observation space, we presented results for the simplest setting where the observation is the latent state augmented with independently sampled noise bits. Future work should consider additional variations, such as observations containing a counter that increments at each step (and modulo a maximum number), or observations made up of the history of previous latent states and previous actions. For these settings, and we hypothesize further modifications to PSRS are necessary to enable accurate and efficient offline simulations.

b.3 Grid world with Continuous Observations

We considered an observation space , induced by the emission function where the noise values are randomly drawn for each time step. Each observation is a 2D Cartesian coordinate in the unit box where the entire grid-world lies in. We collected the logged dataset containing 1,000,000 transitions using a uniformly random policy, for which all recorded observations are visualized in Figure 7-left. Because the behavior policy takes each of the five actions (essentially a random walk), the state visitation in the logged dataset is more concentrated towards the lower left region around the starting state, and less frequent towards the upper right region around the goal cell.

Figure 7: Comparison of the real latent states (left) and learned latent states (right) in the grid world environment. The left figure also shows the state visitation distribution in an offline dataset collected using the random policy.

Since the observation space is now continuous – essentially no two ’s are the same – vanilla PSRS on the observation space is no longer practical. Therefore, we need to apply Algorithm 4 instead, which requires a mapping from observation to latent states. For learning the latent state encoder from offline data, we performed a 50-50 split of the logged dataset to create the training and validation sets containing individual transitions. In order to learn discrete latent states for kinematic inseparability abstraction, we used an architecture and the contrastive estimation objective similar to misra2020kinematic

. The encoder network is a one hidden layer neural network with an adjustable hidden layer size and leaky ReLU hidden activation. It applies to both the current observation

and next observation

, mapping the observation to a vector of logits corresponding to the latent dimension. We incorporated both the backward and forward discrete bottlenecks by applying Gumbel-softmax layers to the encoding of both the current and next observations. The contrastive learning objective is a classification task trying to distinguish between observed and imposter transitions

; in our implementation, we used another one hidden layer network with the same hidden layer size and activation as the encoder. To implement this training objective, in the training loop, we iterate through two batches of data (shuffled differently): the observed transitions are used with classification label 1 and the imposter transitions have classification label 0. We performed a hyperparameter search over the latent dimension, network capacity, learning rate, and different initializations, and selected the final model based on the best validation performance (measured by the classification loss). The final model has a latent dimension of 50, trained with a learning rate of and a hidden layer size of . The learned latent states did not use the full capacity given and are not a perfect match with the ground-truth latent states, as shown in Figure 7; some of them can combine the true latent states (e.g., the learned state 13 combines true states 4 and 9, learned state 8 combines true states 0 and 3, and the states being combined are not necessarily contiguous), or sub-dividing one latent state into multiple (e.g., true state 12 is divided into learned states 10 and 47).

To verify whether the learned latent states are sensible, we conducted a small qualitative experiment comparing the rollouts in the PSRS simulation vs in the real environment when the same sequence of actions is taken by the agent (this is achieved by first performing the PSRS rollouts, and replaying the sequence of actions as generated by PSRS in the real environment). We used a near-optimal -soft policy with derived from the optimal Q-function in the real environment. As shown in Figure 8, the PSRS simulation led to a different distribution of trajectories compared to rollouts in the real environment. In particular, because of the learned latent states combining certain non-adjacent states, the simulated trajectories are likely to jump from the starting state to the lower right region, which does not happen in the real environment. For example, in Figure 9-left, the simulated trajectory deviates substantially from the real trajectory, whereas in other cases such as Figure 9 middle and right, the simulated trajectory matches well with what would be observed in the real environment.

Figure 8: Left: real environment rollouts. Right: PSRS simulation rollouts.
Figure 9: Three comparisons of rollouts in the real environment and from PSRS simulation.

We subsequently used the learned state encoder in offline simulation of a PPO agent. We based our PPO implementation on a public repository,222https://github.com/openai/spinningup/tree/master/spinup/algos/pytorch/ppo and used the following hyperparameters: discount factor of , the actor-critic network has two hidden layers each containing 32 units and ReLU activation, training up to 50 epochs with 5000 steps per epoch. We track two quantities as the function:

  • the learning performance, i.e., average episode returns of the learning agent within each training epoch (shown in Figure 5-left)

  • the validation performance at the end of every epoch as the function, estimated by the average returns of the policy at that time from 10 full episodes (Figure 5-right, Figure 10, Figure 12-left). We also report average episode lengths, which is inversely correlated with episode returns (Figure 11, Figure 12-right).

In addition, we compared to the following baselines:

  • PSRS-oracle: in place of the learned encoder, an oracle encoder is used that maps the observations back to their ground-truth latent states

  • PSRS obs only: within PSRS, ignores the action probabilities (i.e., ignores the rejecting sampling ratio) and just accepts the transition from the queue of the corresponding state

  • PSRS act only: within PSRS, ignores the per-state queues and randomly draw a transition, but respects the action probabilities and rejection sampling ratio

  • PSRS random: randomly draws a next observation from all observed transitions, ignoring both the state and action

Each simulation setting is repeated for 10 runs, for which the results are averaged. We also performed 10 runs of the PPO agent in the real environment. We visualized the learning curves of validation episode returns and validation episode lengths of each simulation (and the real run) in Figures 11 and 10, and summarize the average results in Figure 12. Despite the errors in the latent states from the learned encoder, it performs close to the oracle encoder, and both are close to the real online PPO in the real environment. All other baselines, despite being efficient, are far from accurate simulations and have non-negligible error. Overall, this experiment shows promise that for block MDP, we can learn to encode the observations into latent states and then do offline simulation in an efficient and accurate manner, completely using offline data.

Figure 10: Individual learning curves from each simulation and each of the 10 runs, measured as validation policy performance.
Figure 11: Individual learning curves from each simulation and each of the 10 runs, measured as validation episode lengths.
Figure 12: Average learning curves in terms of validation policy performance and episode lengths.

b.4 Simulating Exo-MDP

Here, we revisit the setting considered in Section B.1 and apply our proposed PSRS-based simulation. Recall that in this problem, we have an observation space induced by the emission function , where an observation is the latent state augmented by a modulo counter . At every transition, the counter value is incremented by and then modulo , i.e. . The initial observation samples the counter uniformly. This setup is motivated by a common way to construct observations in real-life applications where some cyclical/periodic element is included, such as time of day.

We collected 1,000 episodes using a uniformly random policy in this environment with to create the offline dataset, corresponding to transitions. We then applied PSRS on the logged dataset to simulate tabular Q-learning. We compared PSRS with the observations vs PSRS with the underlying states , and in each case, we repeated the simulation procedure for 100 runs with different random seeds. As expected, PSRS with the latent states ( green) is more efficient than using the raw observations ( red) and led to longer simulations (Figure 13). However, perhaps surprisingly, PSRS using the latent states is not an accurate simulation. What makes this problem different from the one considered in Section B.2? In both problems (grid-world with noise vs grid-world with modulo counter), the underlying grid-world MDP is identical; so are the (presumed) latent state spaces and the controllable latent dynamics.

Figure 13: Grid-world with discrete observations. Left: estimated value of starting state for real Q-learning and two simulations (using observations and using latent states directly). Right: fidelity and efficiency of the simulations. Both simulations have perfect fidelity, but using latent state allows for simulating longer.

The main difference comes from the transition dynamics in the observation space. For the grid-world with noise, since we draw the noise bits randomly at each step, the exogenous noise is time-independent and we can write . However, for the grid-world with modulo counter, the “noise” is time-dependent and forms an exogenous process, where where for some . When we apply Algorithm 4 using the latent states , we are destroying the exogenous process in the counter , resulting in the simulation producing impossible transitions, e.g. transitions to . Since the learner we are simulating operates in the raw observation space, such implausible transitions will lead to inaccurate simulations.

Here, we provide one solution for simulating these types of exogenous MDPs. We assume access to the following oracles: that “splits” an observation into its endogenous part and exogenous part , that does the inverse and “merges” the endogenous and exogenous parts into an observation, and reward oracle . We modify the latent PSRS Algorithm 4 to account for exogenous processes as shown in Algorithm 5, where we separately maintain a queue of the exogenous state and maintain the dynamics of the exogenous process in the simulation using this queue. We highlight the main algorithmic modifications in magenta. Applying Algorithm 5 to simulate Q-learning on the problem ( blue approach in Figure 13) leads to both more efficient and more accurate simulations, compared to the two approaches considered earlier in this section.

1:Input: Logged dataset where
2:Input: Learner that maps from history to policy
3: Input: Oracles and
4: Input: , expected reward for a transition reaching
5: Preprocess for each , calculate ,
6:Initialize , zqueues[] = Queue(RandomOrder s.t. )
7: Initialize , cqueues[] = Queue(RandomOrder s.t. )
8:Initialize init_queue = Queue(RandomOrder)
9:// Start simulation
10:Initialize [ ]
11: init_queue.pop()
12:for step to  do
13:      update learner with the history
14:      Calculate
15:     while no transition has been accepted for this step do
16:         if zqueues[] is empty or cqueues[] is empty then terminate          
17:         Sample an endogenous transition = zqueues[].pop()
18:         Calculate
19:         Sample
20:         if  then reject the sampled transition               
21:      Sample an exogenous transition = cqueues[].pop()
22:      Calculate ,
23:     .append();   update history with new transition
24:     if episode ends then
25:         if init_queue is empty then terminate          
26:          init_queue.pop()      
27:Output: History
Algorithm 5 Latent PSRS simulation for Exo-MDP

Note that the experiments above only show what might be required for efficient and accurate simulations for Exo-MDPs and we imposed strong assumptions on the oracles. In practice, such oracles are rarely given directly and must be learned from data, and we believe this is an important area of future research.

Moreover, these experiments highlight an important consideration for choosing the type of latent state discovery algorithms. For example, the AC-State formulation proposed in acstate2022 explicitly extracts the agent-controllable latent states while removing any exogenous noise or processes (in their example where the observation is an image of a scene containing a robot arm, the video played on TV screen in the background is not controllable and thus exogenous). Applying latent PSRS (Algorithm 4) using the latent states discovered by AC-State would lead to inaccurate simulations, e.g. the simulated observation sequence would not be a coherent video on the TV screen. We note that this may be acceptable as long as the simulated learner ignores background. However, if exogenous states (though uncontrollable) are important to the learner otherwise (e.g. it specifies the reward) then such simulations are not reliable.