Optimizing for the Future in Non-Stationary MDPs

by   Yash Chandak, et al.

Most reinforcement learning methods are based upon the key assumption that the transition dynamics and reward functions are fixed, that is, the underlying Markov decision process (MDP) is stationary. However, in many practical real-world applications, this assumption is often violated. We discuss how current methods can have inherent limitations for non-stationary MDPs, and therefore searching for a policy that is good for the future, unknown MDP, requires rethinking the optimization paradigm. To address this problem, we develop a method that builds upon ideas from both counter-factual reasoning and curve-fitting to proactively search for a good future policy, without ever modeling the underlying non-stationarity. Interestingly, we observe that minimizing performance over some of the data from past episodes might be beneficial when searching for a policy that maximizes future performance. The effectiveness of the proposed method is demonstrated on problems motivated by real-world applications.


page 1

page 2

page 3

page 4


Efficient Policy Learning for Non-Stationary MDPs under Adversarial Manipulation

A Markov Decision Process (MDP) is a popular model for reinforcement lea...

Towards Safe Policy Improvement for Non-Stationary MDPs

Many real-world sequential decision-making problems involve critical sys...

Reward is enough for convex MDPs

Maximising a cumulative reward function that is Markov and stationary, i...

Factored Adaptation for Non-Stationary Reinforcement Learning

Dealing with non-stationarity in environments (i.e., transition dynamics...

Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs

We study episodic reinforcement learning (RL) in non-stationary linear k...

Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning

This work tackles the problem of robust zero-shot planning in non-statio...

Minimum-Delay Adaptation in Non-Stationary Reinforcement Learning via Online High-Confidence Change-Point Detection

Non-stationary environments are challenging for reinforcement learning a...

1 Introduction

The flexibility offered by Markov decision processes (MDPs) to model a large class of problems, combined with modern reinforcement learning (RL) algorithms, has provided several success stories in various domains (Mnih et al., 2013; Silver et al., 2017; Sutton and Barto, 2018). However, real-world applications of reinforcement learning are still lacking (Dulac-Arnold et al., 2019).

One reason for the lack of real-world applications is that most existing methods implicitly assume that the environment remains stationary over time. This assumption is often violated in practical problems of interest. For example, consider an assistive driving system. Over time, tires suffer from wear and tear, leading to increased friction and thus, change in the system dynamics. Similarly, in almost all human-computer interaction applications, e.g., automated medical care, dialogue systems, and marketing, human behavior changes over time. In such scenarios, if the automated system is not adapted to take such changes into account, or if it is adapted only after observing such changes, then the system might quickly become sub-optimal, incurring severe loss (Moore et al., 2014). This raises our main question: how do we build systems that proactively search for a policy that will be good for the future MDP?

In this paper we present a policy gradient based approach to search for a policy that maximizes the forecasted future performance when the environment is non-stationary. To capture the impact of changes in the environment on a policy’s performance, first, the performance of each policy of interest over the past episodes is estimated using counter-factual reasoning. Subsequently, a regression curve is fit over these estimates to model the performance trend of each policy over time, thereby enabling the prediction of future performance. This performance forecast is then used to develop an efficient gradient-based optimization procedure that can proactively search for a policy that will perform well in the future. The proposed method has several key advantages:

  • It does not require modeling the transition or reward function in a non-stationary environment, and thus scales gracefully with respect to the number of states and actions in the environment.

  • It is data-efficient in that it leverages all available data.

  • For ease of prediction, it concisely models the effect of changes in the environment on a policy’s performance using a univariate time-series.

  • It mitigates performance lag by proactively optimizing performance for episodes in both the immediate and near future.

  • It degenerates to an estimator of the ordinary policy gradient if the system is stationary, meaning that there is little reason not to use our approach if there is a possibility that the system might be non-stationary.

As a passing remark, we note that even when a reinforcement learning agent is being trained on a stationary environment, the observed transition tuples come from a ‘non-stationary’ distribution. This is due to the changing state distribution induced by updates in the policy parameters over the course of the training. While such non-stationarity exists in our setup as well, it is not the focus of this work. Here, ‘non-stationarity’ refers to the transition dynamics and reward function of an environment changing across episodes as described further in Section 3.

2 Notation

An MDP is a tuple , where is the set of possible states, is the set of actions, is the transition function, is the reward function, is the discount factor, and is the start state distribution. Let denote the expected reward of taking action in state . For any given set , we use to denote the set of distributions over . A policy is a distribution over the actions conditioned on the state. When is parameterized using , we denote it as . In a non-stationary setting, as the MDP changes over time, we use to denote the MDP during episode . In general, we will use sub-script to denote the episode number and a super-script to denote the time-step within an episode. and

are the random variables corresponding to the state, action, and reward at time step

, in episode . Let denote a trajectory in episode : , where is the finite horizon. The value function evaluated at state , during episode , under a policy is , where conditioning on denotes that the trajectory in episode was sampled using . The start state objective for a policy , in episode , is defined as . Let be the performance of an optimal policy for . Often we write in place of when the dependence on is important.

3 Problem Statement

To model non-stationarity, we let an exogenous process change the MDP from to , i.e., between episodes. Let represent a sequence of MDPs, where each MDP is denoted by the tuple .

In many problems, like adapting to friction in robotics, human-machine interaction, etc., the transition dynamics and rewards function change, but every other aspect of the MDP remains the same throughout. Therefore, we make the following assumption.

Assumption 1 (Similar MDPs).

ass:same For any two MDPs, and , the state set , the action set , the starting distribution , and the discount factor are the same.

If the exogenous process changing the MDPs is arbitrary and changes it in unreasonable ways, then there is little hope of finding a good policy for the future MDP as can be wildly different from everything the agent has observed by interacting with the past MDPs, . However, in many practical problems of interest, such changes are smooth and have an underlying (unknown) structure. To make the problem tractable, we therefore assume that both the transition dynamics , and the reward functions vary smoothly over time.

Assumption 2 (Smooth Changes).

ass:smoothness There exist (unknown and small) constants and such that between all successive MDPs, and , ,


Problem Statement. Under ass:same,ass:smoothness, we seek to find a sequence of policies that minimizes lifelong regret:

4 Related Work

The problem of non-stationarity has a long history and no effort is enough to thoroughly review it. Here, we briefly touch upon the most relevant work and defer a more detailed literature review to the appendix.

Perhaps the work most closely related to ours is that of Al-Shedivat et al. (2017)

. They consider a setting where an agent is required to solve test tasks that have different transition dynamics than the training tasks. Using meta-learning, they aim to use training tasks to find an initialization vector for the policy parameters that can be quickly fine-tuned when facing tasks in the test set. In many real-world problems, however, access to such independent training tasks may not be available

a priori. In this work, we are interested in the continually changing setting where there is no boundary between training and testing tasks. As such, we show how their proposed online adaptation technique that fine-tunes parameters, by discarding past data and only using samples observed online, can create performance lag and can therefore be data-inefficient. In settings where training and testing tasks do exist, our method can be leveraged to better adapt during test time, starting from any desired parameter vector.

Recent work by Finn et al. (2019)

aims at bridging both the continuously changing setting and the train-test setting for supervised-learning problems. They propose continuously improving an underlying parameter initialization vector and running a Follow-The-Leader (FTL) algorithm

(Shalev-Shwartz et al., 2012) every time new data is observed. A naive adaption of this for RL would require access to all the underlying MDPs in the past for continuously updating the initialization vector, which would be impractical. Doing this efficiently remains an open question and our method is complementary to choosing the initialization vector. Additionally, FTL based adaptation always lags in tracking optimal performance as it uniformly maximizes performance over all the past samples that might not be directly related to the future. Further, we show that by explicitly capturing the trend in the non-stationarity, we can mitigate this performance lag resulting from the use of an FTL algorithm during the adaptation process.

The problem of adapting to non-stationarity is also related to continual learning (Ring, 1994), lifelong-learning (Thrun, 1998), and meta-learning (Schmidhuber, 1999). Several meta-learning based approaches for fine-tuning a (mixture of) trained model(s) using samples observed during a similar task at test time have been proposed (Nagabandi et al., 2018a, b). Other works have also shown how models of the environment can be used for continual learning (Lu et al., 2019) or using it along with a model predictive control (Wagener et al., 2019). We focus on the model-free paradigm and our approach is complementary to these model-based methods.

More importantly, in many real-world applications, it can be infeasible to update the system frequently if it involves high computational or monetary expense. In such a case, even optimizing for the immediate future might be greedy and sub-optimal. The system should optimize for a longer term in the future, to compensate for the time until the next update is performed. None of the prior approaches can efficiently tackle this problem.

5 Optimizing for the Future

The problem of minimizing lifelong regret is straightforward if the agent has access to sufficient samples, in advance, from the future environment, , that it is going to face (where denotes the current episode number). That is, if we could estimate the start-state objective, , for the future MDP , then we could search for a policy whose performance is close to . However, obtaining even a single sample from the future is impossible, let alone getting a sufficient number of samples. This necessitates rethinking the optimization paradigm for searching for a policy that performs well when faced with the future unknown MDP. There are two immediate challenges here:

  1. How can we estimate without any samples from ?

  2. How can gradients, , of this future performance be estimated?

In this section we address both of these issues using the following idea. When the transition dynamics , and the reward functions are changing smoothly (ass:smoothness) the performances of any policy will also vary smoothly over time. The impact of smooth changes in the environment thus manifests as smooth changes in the performance of any policy, . In cases where there is an underlying, unknown, structure in the changes of the environment, one can now ask: if the performances of over the course of past episodes were known, can we analyze the trend in its past performances to find a policy that maximizes future performance ?

5.1 Forecasting Future Performance

In this section we address the first challenge of estimating future performance and pose it as a time series forecasting problem.

Broadly, this requires two components: (a) A procedure to compute past performances, , of . (b) A procedure to create an estimate, , of ’s future performance, , using these estimated values from the first component. An illustration of this idea is provided in Figure 1.

Figure 1: An illustration, where the blue and red filled circles represent counter-factually reasoned performance estimates of policies and , respectively, using data collected from a given policy . The open circles represent the forecasted performance of and estimated by fitting a curve on the counter-factual estimates represented by filled circles.

Component (a). As we do not have access to the past MDPs for computing the true values of past performances, , we propose computing estimates, , of them from the observed data. That is, in a non-stationary MDP, starting with the fixed transition matrix and the reward function , we want to estimate the performance of a given policy in episode . Leveraging the fact that the changes to the underlying MDP are due to an exogenous processes, we can estimate as,


where and are also random variables. Next we describe how an estimate of can be obtained from (3) using information only from the episode.

To get an unbiased estimate,

, of ’s performance during episode , consider the past trajectory of the episode that was observed when executing a policy . By using counter-factual reasoning (Rosenbaum and Rubin, 1983) and leveraging the per-decision importance sampling (PDIS) estimator (Precup, 2000), an unbiased estimate of is thus given by:111We assume that the distribution of has full support over the set of all possible trajectories of the MDP .


It is worth noting that computing (4) does not require storing all the past policies

, one needs to only store the actions and the probabilities with which these actions were chosen.

Component (b). To obtain the second component, which captures the structure in and predicts future performances, we make use of a forecasting function that estimates future performance conditioned on the past performances:


While can be any forecasting function, we consider

to be an ordinary least squares (OLS) regression model with parameters

, and the following input and output variables,


For any , let denote a -dimensional basis function for encoding the time index. For example, an identity basis , or a Fourier basis, where

Let be the corresponding basis matrix. The solution to above least squares problem is (Bishop, 2006) and the forecast of the future performance can be obtained using,


This procedure enjoys an important advantage – by just using a univariate time-series to estimate future performance, it bypasses the need for modeling the environment, which can be prohibitively hard or even impossible. Further, note that , where typically, and thus the cost of computing the matrix inverse is negligible. These advantages allows this procedure to gracefully scale to more challenging problems, while being robust to the size, , of the state set or the action set .

5.2 Differentiating Forecasted Future Performance

Figure 2: The proposed method from the lens of differentiable programming. At any time , we aim to optimize policy’s parameters, , to maximize its performance in the future, . However, conventional methods (dotted arrows) can not be used to directly optimize for this. In this work, we achieve this as a composition of two programs: one which connects the policy’s parameters to its past performances, and the other which forecasts future performance as a function of these past performances. The optimization procedure then corresponds to taking derivatives through this composition of programs to update policy parameters in a direction that maximizes future performance. Arrows (a) and (b) correspond to the respective terms marked in (10).

In the previous section, we addressed the first challenge and showed how to proactively estimate future performance, , of a policy by explicitly modeling the trend in its past performances . In this section, we address the second challenge to facilitate a complete optimization procedure. A pictorial illustration of the idea is provided in Figure 2.

Gradients for with respect to can be obtained as follows,


The decomposition in (10) has an elegant intuitive interpretation. The terms assigned to in (10) correspond to how the future prediction would change as a function of past outcomes, and the terms in indicate how the past outcomes would change due to changes in the parameters of the policy . In the next paragraphs, we discuss how to obtain the terms and .

To obtain term (a), note that in (8), corresponds to the element of , and so using (5) the gradients of the terms (a) in (10) are,


where represents the element of a vector . Therefore, (12) is the gradient of predicted future performance with respect to an estimated past performance.

The term in (10) corresponds to the gradient of the PDIS estimate of the past performance with respect to policy parameters . The following Property provides a form for (b) that makes its computation straightforward.

Property 1 (PDIS gradient).

Let .


See Appendix B. ∎

5.3 Algorithm

We provide a sketch of our proposed Prognosticator procedure for optimizing the future performance of the policy in Algorithm LABEL:apx:Alg:1

. To make the method more practical, we incorporated two additional modifications to reduce computational cost and variance.

First, it is often desirable to perform an update only after a certain episode interval to reduce computational cost. This raises the question: if a newly found policy will be executed for the next episodes, should we choose this new policy to maximize performance on just the single next episode, or to maximize the average performance over the next episodes? An advantage of our proposed method is that we can easily tune how far in the future we want to optimize for. Thus, to minimize lifelong regret, we propose optimizing for the mean performance over the next episodes. That is, .

Second, notice that if the policy becomes too close to deterministic, there would be two undesired consequences. (a) The policy will not cause exploration, precluding the agent from observing any changes to the environment in states that it no longer revisits—changes that might make entering those states worthwhile to the agent. (b) In the future when estimating using the past performance of , importance sampling will have high variance if the policy executed during episode is close to deterministic. To mitigate these issues, we add an entropy regularizer during policy optimization. More details are available in Appendix D.


6 Understanding the Behavior of Prognosticator

Notice that as the scalar term (a) is multiplied by the PDIS gradient term (b) in (10), the gradient of future performance can be viewed as a weighted sum of off-policy policy gradients. In Figure 3, we provide visualization of the weights for PDIS gradients of each episode , when the performance for episode is forecasted using data from the past episodes. For the specific setting when is an OLS estimator, these weights are independent of in (12) and their pattern remains constant for any given sequence of MDPs.

Importantly, note the occurrence of negative weights in Figure 3 when the identity basis or Fourier basis is used, suggesting that the optimization procedure should move towards a policy that had lower performance in some of the past episodes. While this negative weighting seems unusual at first glance, it has an intriguing interpretation. We discuss this in the following paragraphs.

To better understand these negative weights, consider a qualitative comparison when weights from different methods in Figure 3 are used along with the performance estimates of policies and in Figure 1. Despite having lower estimates of return everywhere, ’s rising trend suggests that it might have higher performance in the future, that is, . Existing online learning methods like FTL, maximize performance on all the past data uniformly (green curve in Figure 3). Similarly, the exponential weights (red curve in Figure 3) are representative of approaches that only optimize using data from recent episodes and discard previous data. Either of these methods that use only non-negative weights can never capture the trend to forecast . However, the weights obtained when using the identity basis would facilitate minimization of performances in the distant past and maximization of performance in the recent past. Intuitively, this means that it moves towards a policy whose performance is on a linear rise, as it expects that policy to have better performance in the future.

While weights from the identity basis are useful for forecasting whether , it cannot be expected that the trend will always be linear as in Figure 1. To be more flexible and allow for any smooth trend, we opt to use the Fourier basis in our experiments. Observe the alternating sign of weights in Figure 3 when using the Fourier basis. This suggests that the optimization procedure will take into account the sequential differences in performances over the past, thereby favoring the policy that has shown the most performance increments in the past. This also avoids restricting the performance trend of a policy to be linear.

Figure 3: The value of for all values of using different basis functions to encode the time index. Notice that many weights are negative when using the identity or Fourier bases.

7 Mitigating Variance

While model-free algorithms for finding a good policy are scalable to large problems, they tend to suffer from high-variance (Greensmith et al., 2004). In particular, the use of importance sampling estimators can increase the variance further (Guo et al., 2017). In our setup, high variance in estimates of past performances of can hinder capturing ’s performance trend, thereby making the forecasts less reliable.

Notice that a major source of variance is the availability of only a single trajectory sample per MDP , for all . If this trajectory , generated using is likely when using , but has near-zero probability when using then the estimated is also nearly zero. While is an unbiased estimate of , information provided by this single is of little use to evaluate . Subsequently, discarding this from time-series analysis, rather than setting it to be

, can make the time series forecast more robust against outliers. In comparison, if trajectory

is unlikely when using but likely when using , then not only is very useful for estimating but it also has a lower chance of occurring in the future, so this trajectory must be emphasized when making a forecast. Such a process of (de-)emphasizing estimates of past returns using the collected data itself can introduce bias, but this bias might be beneficial in this few-sample regime.

To capture this idea formally, we build upon the insights of Hachiya et al. (2012) and Mahmood et al. (2014), who draw an equivalence between weighted least-squares (WLS) estimation and the weighted importance sampling (WIS) (Precup, 2000) estimator. Particularly, let be the discounted return of the trajectory observed from a stationary MDP, and be the importance ratio of the entire trajectory. The WIS estimator, , of the performance of in a stationary MDP can then be obtained as,


To mitigate variance in our setup, we propose extending WIS. In the non-stationary setting, to perform WIS while capturing the trend in performance over time, we use a modified forecasting function , which is a weighted least-squares regression model with a dimensional basis function , and parameters ,


Let be a diagonal weight matrix such that , let be the basis matrix, and let the following be input and output variables,


The solution to the weighted least squares problem in (15) is then given by and the forecast of the future performance can be obtained using,

has several desired properties. It incorporates a notion of how relevant each observed trajectory is towards forecasting, while also capturing the desired trend in performance. The forecasts are less sensitive to the importance sampling variances and the entire process is still fully differentiable.

8 Strictly Generalizing the Stationary Setting

As the agent is unaware of how the environment is changing, a natural question to ask is what if the agent wrongly assumed a stationary environment was non-stationary? What is the quality of of the agent’s performance forecasts? What is the impact of the negative weights on past evaluations of a policy’s performance? Here we answer these questions.

Before stating the formal results, we introduce some necessary notation and two assumptions. Let be the performance of policy for a stationary MDP. Let and be the non-stationary importance sampling (NIS) and non-stationary weighted importance sampling (NWIS) estimators of performance episodes in future. Further, let the basis function used for encoding the time index in both and be such that it satisfies the following conditions: (a) always contains to incorporate a bias/intercept coefficient in least-squares regression (for example, , where are arbitrary functions). (b) has full column rank such that exists. Both these properties are trivially satisfied by most basis functions.

With this notation and these assumptions, we then have the following results indicating that NIS is unbiased like ordinary importance sampling and NWIS is biased like weighted importance sampling.

Theorem 1 (Unbiased NIS).

For all , is an unbiased estimator of , that is . thm:unbiasedNIS

Theorem 2 (Biased NWIS).

For all , may be a biased estimator of , that is, always. thm:unbiasedNWIS

Theorem 3 (Consistent NIS).

For all , is a consistent estimator of , that is as thm:consistentNIS

Theorem 4 (Consistent NWIS).

For all , is a consistent estimator of , that is as thm:consistentNWIS


See Appendix A for all of these proofs. ∎

Since NWIS is biased and consistent like the WIS estimator, we expect it to have similar variance reduction properties that can potentially make the optimization process more efficient in a non-stationary MDP.

9 Empirical Analysis

This section presents empirical evaluations using several environments inspired by real-world applications that exhibit non-stationarity. In the following paragraphs, we briefly discuss each environment; a more detailed description is available in Appendix D.

Non-stationary Diabetes Treatment:

Figure 4: Blood-glucose level of an in-silico patient for hours (one episode). Humps in the graph occur at times when a meal is consumed by the patient.
Figure 5: Best performances of all the algorithms obtained by conducting a hyper-parameter sweep over hyper-parameter combinations per algorithm, per environment. For each hyper-parameter setting, trials were executed for the recommender system and the goal reacher environments, and

trials for the diabetes treatment environment. Error bars correspond to the standard error. The x-axis represents how fast the environment is changing and the y-axis represents regret (lower is better). Individual learning curves for each speed, for each domain, is available in Appendix


This environment is based on an open-source implementation

(Xie, 2019) of the FDA approved Type-1 Diabetes Mellitus simulator (T1DMS) (Man et al., 2014) for treatment of Type-1 Diabetes. Each episode consists of a day in an in-silico patient’s life. Consumption of a meal increases the blood-glucose level in the body and if the blood-glucose level becomes too high, then the patient suffers from hyperglycemia and if the level becomes too low, then the patient suffers from hypoglycemia. The goal is to control the blood-glucose level of a patient by regulating the insulin dosage to minimize the risk associated with both hyper and hypoglycemia.

However, the insulin sensitivity of a patient’s internal body organs vary over time, inducing non-stationarity that should be accounted for. In the T1DMS simulator, we induce this non-stationarity by oscillating the body parameters (e.g., insulin sensitivity, rate of glucose absorption, etc.) between two known configurations available in the simulator.

Non-stationary Recommender System: In this environment a recommender engine interacts with a user whose interests in different items fluctuate over time. In particular, the rewards associated with each item vary in seasonal cycles. The goal is to maximize revenue by recommending an item that the user is most interested in at any time.

Non-stationary Goal Reacher: This is a 2D environment with four (left, right, up, and down) actions and a continuous state set representing the Cartesian coordinates. The goal is to make the agent reach a moving goal position.

For all of the above environments, we regulate the speed of non-stationarity to characterize an algorithms’ ability to adapt. Higher speed corresponds to a greater amount of non-stationarity; A speed of zero indicates that the environment is stationary.

We consider the following algorithms for comparison:

Prognosticator: Two variants of our algorithm, Pro-OLS and Pro-WLS, which use OLS and WLS estimators for .

ONPG: Similar to the adaptation technique presented by Al-Shedivat et al. (2017), this baseline performs purely online optimization by fine-tuning the existing policy using only the trajectory being observed online.

FTRL-PG: Similar to the adaptation technique presented by Finn et al. (2019), this baseline performs Follow-the-(regularized)-leader optimization by maximizing performance over both the current and all the past trajectories.

9.1 Results

In the non-stationary recommender system, as the exact value of is available from the simulator, we can compute the true value of regret. However, for the non-stationary goal reacher and diabetes treatment environment, as is not known for any , we use a surrogate measure for regret. That is, let be the maximum return obtained in episode by any algorithm, then we use as the surrogate regret for a policy .

In the non-stationary recommender system, all the methods perform nearly the same when the environment is stationary. FTRL-PG has a slight edge over ONPG when the environment is stationary as all the past data is directly indicative of the future MDP. It is interesting to note that while FTRL-PG works the best for the stationary setting in the recommender system and the goal reacher task, it is not the best in the diabetes treatment task as it can suffer from high variance. We discuss the impact of variance in later paragraphs.

With the increase in the speed of non-stationarity, performance of both the baselines deteriorate quickly. Of the two, ONPG is better able to mitigate performance lag as it discards all the past data. In contrast, both the proposed methods, Pro-OLS and Pro-WLS, can leverage all the past data to better capture the impact of non-stationarity and thus are consistently more robust to the changes in the environment.

In the non-stationary goal reacher environment, a similar trend as above is observed. While considering all the past data equally is useful for FTRL-PG in the stationary setting, it creates drastic performance lag as the speed of the non-stationarity increases. Between Pro-OLS and Pro-WLS, in the stationary setting, once the agent nearly solves the task all subsequent trajectories come from nearly the same distribution and thus the variance resulting from importance sampling ratio is not severe. In such a case, where the variance is low, Pro-WLS has less advantage over Pro-OLS and additionally suffers from being biased. However, as the non-stationarity increases, the optimal policy keeps changing and there is a higher discrepancy between distributions of past and current trajectories. This makes the lower variance property of Pro-WLS particularly useful. Having the ability to better capture the underlying trend, both Pro-OLS and Pro-WLS consistently perform better than the baselines when there is non-stationarity.

The non-stationary diabetes treatment environment is particularly challenging as it has a continuous action set. This makes importance sampling based estimators subject to much higher variance. Consequently, Pro-OLS is not able to reliably capture the impact of non-stationarity and performs similar to ONPG. In comparison, the combination of both high variance and performance lag makes FTRL-PG perform poorly across all the speeds. The most advantageous algorithm in this environment is Pro-WLS. As it is designed to better tackle variance stemming from importance sampling, Pro-WLS is able to efficiently use the past data to capture the underlying trend and performs well across all the speeds of non-stationarity.

10 Conclusion

We presented a policy gradient-based algorithm that combines counter-factual reasoning with curve-fitting to proactively search for a good policy for future MDPs. Irrespective of the environment being stationary or non-stationary, the proposed method can leverage all the past data, and in non-stationary settings it can pro-actively optimize for future performance as well. Therefore, our method provides a single solution for mitigating performance lag and being data-efficient.

While the proposed algorithm has several desired properties, many open questions remain. In our experiments, we noticed that the proposed algorithm is particularly sensitive to the value of the entropy regularizer . Keeping too high prevents the policy from adapting quickly. Keeping too low lets the policy overfit to the forecast and become close to deterministic, thereby increasing the variance for subsequent importance sampling estimates of policy performance. While we resorted to hyper-parameter search, leveraging methods that adapt automatically might be fruitful (Haarnoja et al., 2018).

Our framework highlights new research directions for studying bias-variance trade-offs in the non-stationary setting. While tackling the problem from the point of view of a univariate time-series is advantageous as the model-bias of the environment can be reduced, this can result in higher variance in the forecasted performance. Developing lower variance off-policy performance estimators for is also an active research direction which directly complements our algorithm. In particular, often a partial model of the environment is available and using it through doubly-robust estimators (Jiang and Li, 2015; Thomas and Brunskill, 2016) is an interesting future direction.

Further, there are other forecasting functions, like kernel regression, Gaussian Processes, ARIMA, etc., and some break-point detection algorithms that can potentially be used to incorporate more domain knowledge in the forecasting function , or make robust to jumps in the time series.

11 Acknowledgement

Part of the work was done when the first author was an intern at Adobe Research, San Jose. The research was later supported by generous gifts from Adobe Research. We are thankful to Ian Gemp, Scott M. Jordan, and Chris Nota for insightful discussions and for providing valuable feedback.



Appendix A Properties of NIS and NWIS Estimators

Here we provide proofs for the properties of NIS and NWIS estimators. While NIS and NWIS are developed for non-stationary setting, these properties ensure that these estimators strictly generalize for the stationary setting as well. That is, when used in stationary setting, NIS estimator is both unbiased and consistent like the PDIS estimator, and NWIS estimator is biased and consistent like the WIS estimator; when NIS and NWIS are used in non-stationary setting they can provide more accurate estimates of a future performance as they explicitly model the trend of policy’s performance over time.

Our proof technique draws inspiration from the results presented by Mahmood et al. (2014). The primary difference is that their results are for estimators that are developed only for the stationary setting and can not be readily used for the non-stationary setting, unlike ours. The key modification that we make to leverage their proof technique is that instead of using the features of the state as the input and the observed return from that corresponding state as the output to the regression function, we use the features of the time index of an episode as the input and the observed return for that corresponding episode as the output. In their setup, as states are stochastically drawn from a distribution their analysis is not directly applicable for our setting, where inputs (time indices) form a deterministic sequence. For analysis of our estimators, we leverage techniques discussed by Greene (2003) for analyzing properties of the ordinary least squares estimator.

Before proceeding, we impose the following constraints on the set of policies, and the basis function used for encoding the time index in both and .

  • always contains to incorporate a bias coefficient in least-squares regression (for example, , where are arbitrary functions).

  • There exists a finite constant , such that .

  • has full column rank such that exists.

  • We only consider set of policies that have non-zero probability of taking any action in any state. That is , such that

Satisfying condition (a) is straightforward as it is already done by all basis functions. Intuitively, this constraint ensure that the regression problem is not ill-defined and there exists a model in the model class that can capture a fixed constant, which corresponds to absence of any trend. This is useful for our purpose as in the stationary setting, there exists no trend in the expected performance between the episodes for any given policy. The forecaster should be capable to infer this fact from the data and represent it. That is, the optimal model parameter is 222If the domain knowledge is available to select an appropriate basis function that can be used to represent the performance trend of all the policies for the required non-stationary environment, then all the following finite-sample and large-sample properties can be extended for that environment as well.


such that for any ,


Conditions (b) and (c) are also readily satisfied by popular basis functions. For example, features obtained using Fourier basis is bounded by , and features from polynomial/identity basis are also bounded when inputs are adequately normalized. Further, when the basis function does not repeat any feature, and number of samples are more than number of features, condition (c) is satisfied. This ensures the the least-squares problem is well-defined and has a unique-solution.

Condition (d) ensures that the denominator in any importance ratio is always bounded below, such that the importance ratios are bounded above. This condition implies that the importance sampling estimator for any evaluation policy has finite variance. Use of entropy regularization with common policy parameterizations (softmax/Gaussian) can prevent violation of this condition.

a.1 Finite Sample Properties

In this subsection, finite sample properties of NIS and NWIS are presented. Specifically, it is established that NIS is an unbiased estimator, whereas NWIS is a biased estimator of , the performance of a policy in a stationary MDP.

Theorem 1 (Unbiased NIS).

For all , is an unbiased estimator of , that is . apx:thm:unbiasedNIS


Recall from (8) that


Therefore, the expected value of is


As and the MDP is stationary, expected value of each element of is . Let denote the vector of size similar to , where all elements are , then


Now using (19) in (23),



(Alternate) Here we present an alternate proof for apx:thm:unbiasedNIS which does not require invoking .


where (a) is the dot product written as summation, and (b) holds because the multiplicative constants are fixed values, as given in (12). Since the environment is stationary, , therefore,


In the following we focus on the terms inside the summation in (31). Without loss of generality, assume that the for a given matrix of features , the feature corresponding to value is in the last column of . Let , and let be the submatrix of such that has all features of except the ones column, . Let

be the identity matrix

, then it can seen that can be expressed as,


In (32), as the row in last column of corresponds to the dot product of the row of , , with ,


Equation (33) ensures that the summation of all rows of , except the last, sum to , and the last one sums to . Now, let . Therefore,


Therefore, combining (41) with (31),

Theorem 2 (Biased NWIS).

For all , may be a biased estimator of , that is, always. apx:thm:unbiasedNWIS


We prove this result using a simple counter-example. Consider the following basis function, , then,


which is the WIS estimator. Therefore, as WIS is a biased estimator, NWIS is also a biased estimator of . ∎

a.2 Large Sample Properties

In this subsection, large sample properties of NIS and NWIS are presented. Specifically, it is established that both NIS and NWIS are consistent estimators of, , the performance of a policy in a stationary MDP.

Theorem 3 (Consistent NIS).

For all , is a consistent estimator of , that is as thm:consistentNIS


Using (8),


As and the MDP is stationary, each element of is an unbiased estimate of . In other words, , where is a mean zero error. Let be the vector containing all the error terms . Now using (19),


Using Slutsky’s Theorem,


where , and it holds from Grenander’s conditions that exists. Informally, Grenander’s conditions require that no feature degenerates to a sequence of zeros, no feature of a single observation dominates the sum of squares of its series, and the matrix always has full rank. These conditions are easily satisfied for most popular basis functions used to create input features. For formal definitions of these conditions, we refer the reader to Chpt. 5, Greene (2003).

In the following, we restrict our focus to the term inside the brackets in the second term of (54). The mean of that term is,


Since the mean is , variance of that term is given by,


As each policy has a non-zero probability of taking any action in any state, the variance of PDIS (or the standard IS) estimator is bounded and thus each element of is bounded. Further, as is bounded, each element of is also bounded. Therefore, as


As mean is and variance asymptotes to , then as . Combining this with (54),