1 Introduction & Related Work
Exploration is one of the central problems of RL and has been studied primarily assuming access to an online environment (sutton2018). Offline RL, the problem of learning a policy from a previously collected experience history, has gained a lot of attention in the recent years (offlineRLtutorial; fu2020d4rl). Yet the combination of the two, i.e., reasoning about exploration and agent’s learning process given just an offline dataset, though a problem of tremendous potential value, has not been covered extensively by the RL research community. Offline policy evaluation (OPE), a subproblem of offline RL, focuses on evaluating performance of a fixed policy using an offline dataset (Figure 1). In many realworld scenarios, including recommender systems, personalizable web services, robots required to adapt to new tasks, etc., instead of having fixed policies, we would like the agent to continue learning after deployment. This requires the agent to explore and react to its experience in the environment by adapting its policy. OPE ignores these factors and is not the right framework to assess such agents. In this work we propose offline learner simulation (OLS) as a way to evaluate nonstationary agents given just an offline dataset.
A natural approach to simulate learners can be to leverage modelbased RL (also used successfully as an OPE method  see fu2021benchmarks)  by learning a world model on the offline dataset, and then using that model to generate new rollouts. While simple to use, the learned model incurs bias that may be hard to measure and reason about. On the other side of the spectrum, we have nonparametric approaches replaying data in certain ways to match the true environment’s distribution: john2011 discussed this approach in the contextual bandit setting, and mandel2016offline
extended the idea to Markov decision processes (MDPs). These methods are provably unbiased and allow for simulating a learning process with provable absolute accuracy, but become inefficient for all but the simplest toy problems. More realistic environments, with rich observations and stochastic transitions, would lead to simulation terminating after few steps, deeming these methods impractical.
In this work, we incorporate recent advances in latent state discovery (du2019provably; misra2020kinematic; acstate2022), which allow one to recover the unobserved latent states from potentially highdimensional rich observations, with the modelfree datadriven approaches proposed in mandel2016offline, to improve their efficiency and practicality. In the preliminary experiments, we evaluate the methods by comparing the obtained simulations to the true online learners in terms of fidelity and efficiency. We show that newly proposed methods are able to simulate a learning process with high fidelity, and are capable of producing longer simulations than fully nonparametric approaches.
2 Problem Setup
We consider reinforcement learning (RL) in block Markov decision processes (Block MDPs), defined by a large (possibly infinite) observation space , a finite unobservable state space , a finite action space , transition function , emission function , reward function , initial state distribution , and discount factor . A policy specifies a distribution over actions for each observation. In RL, a learner (often realized by executing a learning algorithm), denoted by , defines a mapping from some history of interactions of arbitrary length to a policy . Here, a history of interactions of length is an ordered sequence of transition tuples , and is the set of all histories. Consider the interaction cycle between the learner and the environment: at step , the learner has seen its interactions with the environment during steps and makes use of the history so far to define its policy for subsequent interaction(s). Overall, the learner follows a nonstationary policy where, importantly, the sequence of policies that constitute this nonstationary policy is not known in advance.
In this work, we consider the problem of offline learner simulation (OLS), which is used to gain an understanding of what a learner would “perform” in the real environment; we might be interested in how the learner would gather data and explore the environment, or how quickly the learner would converge to the optimal policy. Given a logged dataset of past interactions , in order to simulate a (blackbox) learner up to step , it is necessary to provide the learner with a history that is drawn from a distribution identical to the one observed if the learner interacted with the real environment
. This is in contrast to OPE, where it is usually sufficient to obtain a value function estimate (
Figure 1).2.1 Evaluation Protocol
A “good” offline simulation should run as accurately as possible for as long as possible. Therefore, we propose to quantify the success of offline simulations via two aspects  efficiency and fidelity.
Efficiency can be measured by the length of histories generated by the simulation before it terminates. While some simulation approaches allow the learner to run indefinitely, this often comes at the cost of large biases. Therefore, we also consider simulation approaches that have the option to “terminate”. In Figure 2, sim2 is more efficient than sim1 because it terminates after more simulation steps.
To measure fidelity, in theory, we want to compare the distribution of histories generated by the simulation to the real distribution . Since these distributions may be difficult to represent analytically, in practice, we instead use aggregate scalar statistics as proxy measures.^{1}^{1}1We caution that being close to is only a necessary condition for the being close to , but not a sufficient condition. For example, we may have  the expected policy performance after the learner has received a length history . Note that the specific choice of is dependent on the learner and the environment; some alternative choices include: state visitation distribution in the history , or the learner’s model parameters. The expectation over distributions of histories can be approximated empirically using averaged results from multiple simulation runs. Suppose we are interested in the fidelity of a simulation from training step to . As shown in Figure 2, we can visualize for as learning curves, and measure the error in simulation as the RMSE between the learning curves from the simulation vs the groundtruth over all steps (alternatively, one may use the mean/max absolute error). For , sim1 is a more “accurate” simulation than sim2, even though sim2 eventually converged to the correct policy after whereas sim1 is not able to run till convergence. Comparisons of fidelity are only meaningful for the same .
Both efficiency and fidelity are important for offline simulation, yet it is not straightforward to define a single metric that captures both. Since there is usually a tradeoff between the two (similar to the biasvariance tradeoff), in our experiments below, we consider these two aspects separately.
3 Nonparametric & Semiparametric Offline Simulation
mandel2016offline proposed several nonparametric approaches for OLS in RL including the queuebased evaluator (QBE) and perstate rejection sampling (PSRS). We focus on PSRS because it was shown to be more efficient than QBE. In PSRS, each transition in the logged dataset is only considered once, and is either accepted and given to the learner, or rejected and discarded. This ensures the unbiasedness of the overall simulation. In Section A.1, we restate these two algorithms with a few modifications to allow for more general settings such as episodic problems with multiple initial states and nonstationary logging policies.
A key step in PSRS (as the name suggests) is line 3 (Algorithm 3) which groups transitions in the logged dataset into queues based on the fromstate, and the simulation terminates whenever it hits an empty queue. This works reasonably well for tabular MDPs. For block MDPs, a naive approach is to treat the observations as the keys to group transitions by. Since we often have an infinite observation space, every observation is only seen in the logged dataset once, making this approach very inefficient and impractical because the queues would be empty before we are able to simulate long enough. Fortunately, recent advances in latent state discovery (du2019provably; misra2020kinematic; acstate2022) allow one to recover the unobserved latent states from potentially highdimensional rich observations. For simulating such block MDPs, we propose to first learn a latent state encoder, preprocess the logged dataset into latent states, and perform PSRS while grouping transitions using the latent states (highlevel pseudo code shown in Algorithm 1; for more details, see Section A.2). Importantly, the simulation interfaces with learners using raw observations and is thus compatible with the learners that would be used with the original block MDP.
4 Proof of Concept Experiments
In this section, we conducted empirical evaluations of offline simulation for simple block MDPs. First, in a problem with known latent states, we show that using the latent states for offline simulation is more efficient than using the observations, without sacrificing simulation fidelity. Building on this result, we then consider a more challenging scenario where the latent states are not known and must be discovered from data, and demonstrate the effectiveness of our semiparametric simulation approach even when the learned latent state encoder is not perfect. As the building block for our experiments, we introduce a grid world navigation task (Figure 3left) modified from zintgraf2019varibad, further described in Appendix B.

GridWorld with Discrete Observations. In this setting, we use a discrete observation space induced by the emission function , where an observation is made up of the underlying state and bits that are stochastically sampled at every time step. We first collected 1,000 episodes following a uniformly random policy, and then used PSRS on the logged dataset to simulate tabular Qlearning. We compared PSRS with the observations vs PSRS with the underlying states , and in each case, we repeated the simulation procedure for 100 runs with different random seeds. We show results for , where the observation space is 16 times larger than the state space. Figure 3middle shows the learning curves for all runs, where we track the estimated value of the initial state as the function, since we are in the tabular setting and have transparency over the learner’s internal parameters. In Figure 3right we show the aggregate result for fidelity and efficiency. Both are accurate simulations as their learning curves are both overlapping exactly with real Qlearning, but using latent states is more efficient than using raw observations and led to simulations about twice as long for this problem. Additional variations of this experiment are explored further in Section B.2.
(a) (b) 

: quantitative results comparing efficiency (median simulation length in epochs, higher is better) and fidelity (RMSE of validation performance curves compared to PPO in a real environment, lower is better).
GridWorld with Continuous Observations. Next, we consider a more complex observation space, where the observation is a randomly sampled 2D coordinate within the grid cell of state (normalized to be between 0 and 1). We similarly collected the logged dataset using a uniformly random policy, leading to the state visitation distribution shown in Figure 4a. Note that since the observation space is now continuous – essentially no two
’s are the same – PSRS on the observation space is no longer practical. Therefore, we trained a neural network encoder for discrete latent states for kinematic inseparability abstraction, using the contrastive estimation objective similar to
misra2020kinematic. While we used a latent dimension of 50, it ended up learning only 20 discrete latent states as shown in Figure 4b. We subsequently used the learned state encoder in offline simulation of a PPO agent using PSRS and compared to several baselines. We visualized the learning curves (Figure 5) by tracking policy performance within each training epoch (Figure 5left) and in a validation environment at the end of each epoch, obtained via averaging the returns from 10 MonteCarlo rollouts in the true environment (Figure 5right). Summarized numerical results on efficiency and fidelity of the validation performance are shown in Figure 4right. Despite the errors in the latent states from the learned encoder, it performs close to the oracle encoder, and both are close to the online PPO in the real environment. All other baselines, despite being efficient, are far from accurate simulations and have nonnegligible error. Overall, this experiment shows promise that for block MDPs, we can learn to encode the observations into latent states and then do offline simulations in an efficient and accurate manner, completely using offline data. Further experimental details, including explanation of the used baselines, are in Section B.3.5 Conclusion
In this work, we studied offline learner simulation (OLS), which allows one to evaluate an agent’s learning process using offline datasets. We formally described the evaluation protocol for OLS in terms of efficiency and fidelity, and proposed semiparametric simulation approaches to handle block MDPs with rich observations by extending existing nonparametric approaches and leveraging recent advances in latent state discovery. Through preliminary experiments, we show the advantage of this approach even when the learned latent states are not perfectly correct. Code to reproduce experimental results will be released publicly upon publication of this paper.
Besides applications in recommender systems and robotics, OLS may be especially useful for multitask and metalearning settings, where simulation on a subset of tasks may inform us about future adaptive performance on other tasks. It may prove to be a crucial component to succeed in offline metaRL (OfflineMetaRL; mitchell2021offline; offlineMetaRLWithOnlineSelfSupervision). Future work should also consider removing the assumption on discrete latent topology and accounting for exogenous processes (further discussed in Section B.4) to handle a wider class of problems.
References
Appendix A Algorithms
a.1 Nonparametric Simulation for Tabular MDPs
The QBE (Algorithm 2) and PSRS (Algorithm 3) algorithms are based on mandel2016offline. We made the following two modifications to the versions originally proposed to allow for more general settings such as episodic problems with multiple initial states and nonstationary logging policies.

L4 and L6 in both algorithms, L1315 in QBE and L1618 in PSRS: to support episodic settings, we initialize a set of possible starting states in the init_queue, and during simulation we obtain a new starting state from this set whenever the episode ends.

L1112 in PSRS: to support nonstationary logging policies, assuming the logged dataset have stored the action distribution of the behavior policy that each observed action is drawn from, we recalculate for each candidate transition in the while loop.
a.2 Semiparametric Simulation for Block MDPs
In this appendix section we present a more detailed version of Algorithm 1 from the main paper.
We extend Algorithm 3 to support block MDPs by assuming access to an encoder function that maps from observation to its corresponding latent state (recall that under the definition of block MDPs, there is a unique latent state that can emit given observation ). Note that the same extension can be applied to Algorithm 2 but is omitted here. The main parts of Algorithm 4 follows the vanilla PSRS algorthm, with the following key differences (highlighted in magenta):

L3: the algorithm requires the latent state encoder as an additional input.

L4: we preprocess the observations into corresponding latent states for all transitions in the logged dataset.

L5: when grouping transitions into queues, we use the precomputed latent states as the key instead of the observations.

L6: the set of starting observations are taken as is without being converted to latent states.

L12L14: when retrieving from queues, we first compute the latent state of the current observation , and then query the corresponding queue for .

L15L18: the policies still use observations as input, maintaining the agentenvironment interface of the original blockMDP that this algorithm is simulating.
a.2.1 Latent State Encoder
Recent work has proposed various ways of learning the latent state encoder for block MDPs, including (du2019provably), HOMER (misra2020kinematic), PPE (efroni2021provably), ACState (acstate2022), AIS (subramanian2022approximate). In this work, we adapt the approach used in HOMER, which can learn a latent state abstraction for kinematic inseparability (misra2020kinematic)
, where two observations are combined into a single latent state if and only if (i) they have the same distribution over next observations, and (ii) they have the same (joint) distribution over previous observations and previous actions. Instead of the online interactive version of the HOMER algorithm, we take the state abstraction learning component and train it on offline data. As proven by
misra2020kinematic, the kinematic inseparability abstraction can be learned via a contrastive estimation objective in a supervised classification problem:where is the latent state encoder that takes a single observation as input,
is a binary classifier for a transition
in the latent space, and the overall objective is minimizing the negative likelihood of a binary classification problem. The noise contrastive distribution is constructed as follows: first randomly draw a transition from the logged dataset, then either oris produced with probability
, where is from an independent draw from the logged dataset , and indicates whether the transition is real (1) or an imposter (0). Instead of preconstructing the distribution into a full dataset, for efficient implementation during minibatch learning, we independently sample twice from the logged dataset and include both and in a batch. For more details on the theoretical guarantees, see misra2020kinematic.Appendix B Experimental Details on GridWorld
Environment description
We consider a navigation task in a grid world (Figure 6), based on zintgraf2019varibad. The discrete state space is represented by the index of each cell, with . The agent starts from the bottom left cell, and the goal is the top right cell. There are actions: stay, up, right, down, left that each deterministically moves the agent to the adjacent cell in that direction (if the action would move agent out of bounds then the agent stays at the current cell). Reward is for reaching the goal and for each step otherwise. There is no cap on maximum episode length.
b.1 Illustrative Example of Evaluation Protocol (Figure 2)
We consider an observation space induced by the emission function , where an observation is the latent state augmented by a modulo counter . At every transition, the counter value is incremented by and then modulo , i.e. . The initial observation samples the counter uniformly. This setup is motivated by a common way to construct observations in reallife applications where some cyclical/periodic element is included, such as time of day.
We collected 1,000 episodes using a uniformly random policy in this environment with to create the offline dataset, corresponding to transitions. For simulating learners, we used a modelbased simulation strategy, which first builds an environment model via maximum likelihood estimation of the transition dynamics (assuming the reward function is given), and then uses the estimated environment model as the simulator to interact with the agent. We considered two versions of modelbased simulation: sim1 builds the environment model in the observation space , and sim2 does so in the latent state space (assuming is recorded in the data) and uses a uniform emission function for where .
For Figure 2, we used the modelbased simulations described above for simulating Qlearning with
greedy exploration using the following hyperparameters: exploration rate
, learning rate , discount . The modelbased simulation is repeated for 100 runs using different random seeds, and the results are averaged. This is compared to online Qlearning using the same hyperparameters (also averaged over 100 repeated runs). To illustrate the tradeoff between fidelity and efficiency, we truncated the simulation with observationspace model to 50,000 training steps, and the simulation with latent state model to 100,000 steps. We see from Figure 2 that using the observationspace model is not a highfidelity simulation, since its learning curve starts to deviate from online Qlearning after around 20,000 training steps.b.2 GridWorld with Discrete Observations
For this experiment, we use a discrete observation space induced by the emission function , where an observation is made up of the underlying state and noisy bits that are stochastically sampled at every time step. For numerical study we considered , corresponding to observation spaces that are times larger than the state space. For illustration, we show which has the clearest trend.
We first collected 1,000 episodes following a uniformly random policy, and then used PSRS on the logged dataset to simulate different “learners”. Depending on the type of learner, we used different function to measure the simulation fidelity. We compared PSRS with the observations and with the underlying states, and in each case, we repeated the simulation procedure for 100 runs with different random seeds. Both simulations are compared to the learner interacting with the real environment, also for 100 repeated runs.
Setting 1 
Setting 2 
Setting 3 
Setting 1: Agent performs MonteCarlo evaluation of a uniformly random policy ().
Here, instead of tracking the progress over many steps, we use the final MC policy value estimate as the
function. Because the evaluated policy is identical to the logging policy, the sampled transition in PSRS is never rejected, and PSRS with latent states can make use of all the samples. However, compared to using the observations, using latent states led to simulations that are roughly two times longer, and produce a tighter confidence interval in the resulting value estimate. Both are unbiased estimate of the groudtruth policy value.
Setting 2: Agent performs MonteCarlo evaluation of a nearoptimal policy ().
We first learned the optimal Qfunction on the true environment, and then constructed an soft policy with . We used the same function as in Setting 1. Because the evaluated learner’s policy is different from the logging policy, some of the sampled transition in PSRS are rejected, and the simulations are much shorter (about of Setting 1). The general trend in Setting 1 still holds.
Setting 3: Agent performs Qlearning with greedy exploration ().
This setting is discussed in Figure 3. Here, the learner is a tabular Qlearning agent with zero initialization, and uses an greedy exploration with constant exploration rate of .
Discussion
For the setting with a discrete observation space, we presented results for the simplest setting where the observation is the latent state augmented with independently sampled noise bits. Future work should consider additional variations, such as observations containing a counter that increments at each step (and modulo a maximum number), or observations made up of the history of previous latent states and previous actions. For these settings, and we hypothesize further modifications to PSRS are necessary to enable accurate and efficient offline simulations.
b.3 Grid world with Continuous Observations
We considered an observation space , induced by the emission function where the noise values are randomly drawn for each time step. Each observation is a 2D Cartesian coordinate in the unit box where the entire gridworld lies in. We collected the logged dataset containing 1,000,000 transitions using a uniformly random policy, for which all recorded observations are visualized in Figure 7left. Because the behavior policy takes each of the five actions (essentially a random walk), the state visitation in the logged dataset is more concentrated towards the lower left region around the starting state, and less frequent towards the upper right region around the goal cell.
Since the observation space is now continuous – essentially no two ’s are the same – vanilla PSRS on the observation space is no longer practical. Therefore, we need to apply Algorithm 4 instead, which requires a mapping from observation to latent states. For learning the latent state encoder from offline data, we performed a 5050 split of the logged dataset to create the training and validation sets containing individual transitions. In order to learn discrete latent states for kinematic inseparability abstraction, we used an architecture and the contrastive estimation objective similar to misra2020kinematic
. The encoder network is a one hidden layer neural network with an adjustable hidden layer size and leaky ReLU hidden activation. It applies to both the current observation
and next observation, mapping the observation to a vector of logits corresponding to the latent dimension. We incorporated both the backward and forward discrete bottlenecks by applying Gumbelsoftmax layers to the encoding of both the current and next observations. The contrastive learning objective is a classification task trying to distinguish between observed and imposter transitions
; in our implementation, we used another one hidden layer network with the same hidden layer size and activation as the encoder. To implement this training objective, in the training loop, we iterate through two batches of data (shuffled differently): the observed transitions are used with classification label 1 and the imposter transitions have classification label 0. We performed a hyperparameter search over the latent dimension, network capacity, learning rate, and different initializations, and selected the final model based on the best validation performance (measured by the classification loss). The final model has a latent dimension of 50, trained with a learning rate of and a hidden layer size of . The learned latent states did not use the full capacity given and are not a perfect match with the groundtruth latent states, as shown in Figure 7; some of them can combine the true latent states (e.g., the learned state 13 combines true states 4 and 9, learned state 8 combines true states 0 and 3, and the states being combined are not necessarily contiguous), or subdividing one latent state into multiple (e.g., true state 12 is divided into learned states 10 and 47).To verify whether the learned latent states are sensible, we conducted a small qualitative experiment comparing the rollouts in the PSRS simulation vs in the real environment when the same sequence of actions is taken by the agent (this is achieved by first performing the PSRS rollouts, and replaying the sequence of actions as generated by PSRS in the real environment). We used a nearoptimal soft policy with derived from the optimal Qfunction in the real environment. As shown in Figure 8, the PSRS simulation led to a different distribution of trajectories compared to rollouts in the real environment. In particular, because of the learned latent states combining certain nonadjacent states, the simulated trajectories are likely to jump from the starting state to the lower right region, which does not happen in the real environment. For example, in Figure 9left, the simulated trajectory deviates substantially from the real trajectory, whereas in other cases such as Figure 9 middle and right, the simulated trajectory matches well with what would be observed in the real environment.
We subsequently used the learned state encoder in offline simulation of a PPO agent. We based our PPO implementation on a public repository,^{2}^{2}2https://github.com/openai/spinningup/tree/master/spinup/algos/pytorch/ppo and used the following hyperparameters: discount factor of , the actorcritic network has two hidden layers each containing 32 units and ReLU activation, training up to 50 epochs with 5000 steps per epoch. We track two quantities as the function:

the learning performance, i.e., average episode returns of the learning agent within each training epoch (shown in Figure 5left)

the validation performance at the end of every epoch as the function, estimated by the average returns of the policy at that time from 10 full episodes (Figure 5right, Figure 10, Figure 12left). We also report average episode lengths, which is inversely correlated with episode returns (Figure 11, Figure 12right).
In addition, we compared to the following baselines:

PSRSoracle: in place of the learned encoder, an oracle encoder is used that maps the observations back to their groundtruth latent states

PSRS obs only: within PSRS, ignores the action probabilities (i.e., ignores the rejecting sampling ratio) and just accepts the transition from the queue of the corresponding state

PSRS act only: within PSRS, ignores the perstate queues and randomly draw a transition, but respects the action probabilities and rejection sampling ratio

PSRS random: randomly draws a next observation from all observed transitions, ignoring both the state and action
Each simulation setting is repeated for 10 runs, for which the results are averaged. We also performed 10 runs of the PPO agent in the real environment. We visualized the learning curves of validation episode returns and validation episode lengths of each simulation (and the real run) in Figures 11 and 10, and summarize the average results in Figure 12. Despite the errors in the latent states from the learned encoder, it performs close to the oracle encoder, and both are close to the real online PPO in the real environment. All other baselines, despite being efficient, are far from accurate simulations and have nonnegligible error. Overall, this experiment shows promise that for block MDP, we can learn to encode the observations into latent states and then do offline simulation in an efficient and accurate manner, completely using offline data.
b.4 Simulating ExoMDP
Here, we revisit the setting considered in Section B.1 and apply our proposed PSRSbased simulation. Recall that in this problem, we have an observation space induced by the emission function , where an observation is the latent state augmented by a modulo counter . At every transition, the counter value is incremented by and then modulo , i.e. . The initial observation samples the counter uniformly. This setup is motivated by a common way to construct observations in reallife applications where some cyclical/periodic element is included, such as time of day.
We collected 1,000 episodes using a uniformly random policy in this environment with to create the offline dataset, corresponding to transitions. We then applied PSRS on the logged dataset to simulate tabular Qlearning. We compared PSRS with the observations vs PSRS with the underlying states , and in each case, we repeated the simulation procedure for 100 runs with different random seeds. As expected, PSRS with the latent states ( green) is more efficient than using the raw observations ( red) and led to longer simulations (Figure 13). However, perhaps surprisingly, PSRS using the latent states is not an accurate simulation. What makes this problem different from the one considered in Section B.2? In both problems (gridworld with noise vs gridworld with modulo counter), the underlying gridworld MDP is identical; so are the (presumed) latent state spaces and the controllable latent dynamics.
The main difference comes from the transition dynamics in the observation space. For the gridworld with noise, since we draw the noise bits randomly at each step, the exogenous noise is timeindependent and we can write . However, for the gridworld with modulo counter, the “noise” is timedependent and forms an exogenous process, where where for some . When we apply Algorithm 4 using the latent states , we are destroying the exogenous process in the counter , resulting in the simulation producing impossible transitions, e.g. transitions to . Since the learner we are simulating operates in the raw observation space, such implausible transitions will lead to inaccurate simulations.
Here, we provide one solution for simulating these types of exogenous MDPs. We assume access to the following oracles: that “splits” an observation into its endogenous part and exogenous part , that does the inverse and “merges” the endogenous and exogenous parts into an observation, and reward oracle . We modify the latent PSRS Algorithm 4 to account for exogenous processes as shown in Algorithm 5, where we separately maintain a queue of the exogenous state and maintain the dynamics of the exogenous process in the simulation using this queue. We highlight the main algorithmic modifications in magenta. Applying Algorithm 5 to simulate Qlearning on the problem ( blue approach in Figure 13) leads to both more efficient and more accurate simulations, compared to the two approaches considered earlier in this section.
Note that the experiments above only show what might be required for efficient and accurate simulations for ExoMDPs and we imposed strong assumptions on the oracles. In practice, such oracles are rarely given directly and must be learned from data, and we believe this is an important area of future research.
Moreover, these experiments highlight an important consideration for choosing the type of latent state discovery algorithms. For example, the ACState formulation proposed in acstate2022 explicitly extracts the agentcontrollable latent states while removing any exogenous noise or processes (in their example where the observation is an image of a scene containing a robot arm, the video played on TV screen in the background is not controllable and thus exogenous). Applying latent PSRS (Algorithm 4) using the latent states discovered by ACState would lead to inaccurate simulations, e.g. the simulated observation sequence would not be a coherent video on the TV screen. We note that this may be acceptable as long as the simulated learner ignores background. However, if exogenous states (though uncontrollable) are important to the learner otherwise (e.g. it specifies the reward) then such simulations are not reliable.