1 Introduction
The flexibility offered by Markov decision processes (MDPs) to model a large class of problems, combined with modern reinforcement learning (RL) algorithms, has provided several success stories in various domains (Mnih et al., 2013; Silver et al., 2017; Sutton and Barto, 2018). However, realworld applications of reinforcement learning are still lacking (DulacArnold et al., 2019).
One reason for the lack of realworld applications is that most existing methods implicitly assume that the environment remains stationary over time. This assumption is often violated in practical problems of interest. For example, consider an assistive driving system. Over time, tires suffer from wear and tear, leading to increased friction and thus, change in the system dynamics. Similarly, in almost all humancomputer interaction applications, e.g., automated medical care, dialogue systems, and marketing, human behavior changes over time. In such scenarios, if the automated system is not adapted to take such changes into account, or if it is adapted only after observing such changes, then the system might quickly become suboptimal, incurring severe loss (Moore et al., 2014). This raises our main question: how do we build systems that proactively search for a policy that will be good for the future MDP?
In this paper we present a policy gradient based approach to search for a policy that maximizes the forecasted future performance when the environment is nonstationary. To capture the impact of changes in the environment on a policy’s performance, first, the performance of each policy of interest over the past episodes is estimated using counterfactual reasoning. Subsequently, a regression curve is fit over these estimates to model the performance trend of each policy over time, thereby enabling the prediction of future performance. This performance forecast is then used to develop an efficient gradientbased optimization procedure that can proactively search for a policy that will perform well in the future. The proposed method has several key advantages:

It does not require modeling the transition or reward function in a nonstationary environment, and thus scales gracefully with respect to the number of states and actions in the environment.

It is dataefficient in that it leverages all available data.

For ease of prediction, it concisely models the effect of changes in the environment on a policy’s performance using a univariate timeseries.

It mitigates performance lag by proactively optimizing performance for episodes in both the immediate and near future.

It degenerates to an estimator of the ordinary policy gradient if the system is stationary, meaning that there is little reason not to use our approach if there is a possibility that the system might be nonstationary.
As a passing remark, we note that even when a reinforcement learning agent is being trained on a stationary environment, the observed transition tuples come from a ‘nonstationary’ distribution. This is due to the changing state distribution induced by updates in the policy parameters over the course of the training. While such nonstationarity exists in our setup as well, it is not the focus of this work. Here, ‘nonstationarity’ refers to the transition dynamics and reward function of an environment changing across episodes as described further in Section 3.
2 Notation
An MDP is a tuple , where is the set of possible states, is the set of actions, is the transition function, is the reward function, is the discount factor, and is the start state distribution. Let denote the expected reward of taking action in state . For any given set , we use to denote the set of distributions over . A policy is a distribution over the actions conditioned on the state. When is parameterized using , we denote it as . In a nonstationary setting, as the MDP changes over time, we use to denote the MDP during episode . In general, we will use subscript to denote the episode number and a superscript to denote the timestep within an episode. and
are the random variables corresponding to the state, action, and reward at time step
, in episode . Let denote a trajectory in episode : , where is the finite horizon. The value function evaluated at state , during episode , under a policy is , where conditioning on denotes that the trajectory in episode was sampled using . The start state objective for a policy , in episode , is defined as . Let be the performance of an optimal policy for . Often we write in place of when the dependence on is important.3 Problem Statement
To model nonstationarity, we let an exogenous process change the MDP from to , i.e., between episodes. Let represent a sequence of MDPs, where each MDP is denoted by the tuple .
In many problems, like adapting to friction in robotics, humanmachine interaction, etc., the transition dynamics and rewards function change, but every other aspect of the MDP remains the same throughout. Therefore, we make the following assumption.
Assumption 1 (Similar MDPs).
ass:same For any two MDPs, and , the state set , the action set , the starting distribution , and the discount factor are the same.
If the exogenous process changing the MDPs is arbitrary and changes it in unreasonable ways, then there is little hope of finding a good policy for the future MDP as can be wildly different from everything the agent has observed by interacting with the past MDPs, . However, in many practical problems of interest, such changes are smooth and have an underlying (unknown) structure. To make the problem tractable, we therefore assume that both the transition dynamics , and the reward functions vary smoothly over time.
Assumption 2 (Smooth Changes).
ass:smoothness There exist (unknown and small) constants and such that between all successive MDPs, and , ,
(1)  
(2) 
Problem Statement. Under ass:same,ass:smoothness, we seek to find a sequence of policies that minimizes lifelong regret:
4 Related Work
The problem of nonstationarity has a long history and no effort is enough to thoroughly review it. Here, we briefly touch upon the most relevant work and defer a more detailed literature review to the appendix.
Perhaps the work most closely related to ours is that of AlShedivat et al. (2017)
. They consider a setting where an agent is required to solve test tasks that have different transition dynamics than the training tasks. Using metalearning, they aim to use training tasks to find an initialization vector for the policy parameters that can be quickly finetuned when facing tasks in the test set. In many realworld problems, however, access to such independent training tasks may not be available
a priori. In this work, we are interested in the continually changing setting where there is no boundary between training and testing tasks. As such, we show how their proposed online adaptation technique that finetunes parameters, by discarding past data and only using samples observed online, can create performance lag and can therefore be datainefficient. In settings where training and testing tasks do exist, our method can be leveraged to better adapt during test time, starting from any desired parameter vector.Recent work by Finn et al. (2019)
aims at bridging both the continuously changing setting and the traintest setting for supervisedlearning problems. They propose continuously improving an underlying parameter initialization vector and running a FollowTheLeader (FTL) algorithm
(ShalevShwartz et al., 2012) every time new data is observed. A naive adaption of this for RL would require access to all the underlying MDPs in the past for continuously updating the initialization vector, which would be impractical. Doing this efficiently remains an open question and our method is complementary to choosing the initialization vector. Additionally, FTL based adaptation always lags in tracking optimal performance as it uniformly maximizes performance over all the past samples that might not be directly related to the future. Further, we show that by explicitly capturing the trend in the nonstationarity, we can mitigate this performance lag resulting from the use of an FTL algorithm during the adaptation process.The problem of adapting to nonstationarity is also related to continual learning (Ring, 1994), lifelonglearning (Thrun, 1998), and metalearning (Schmidhuber, 1999). Several metalearning based approaches for finetuning a (mixture of) trained model(s) using samples observed during a similar task at test time have been proposed (Nagabandi et al., 2018a, b). Other works have also shown how models of the environment can be used for continual learning (Lu et al., 2019) or using it along with a model predictive control (Wagener et al., 2019). We focus on the modelfree paradigm and our approach is complementary to these modelbased methods.
More importantly, in many realworld applications, it can be infeasible to update the system frequently if it involves high computational or monetary expense. In such a case, even optimizing for the immediate future might be greedy and suboptimal. The system should optimize for a longer term in the future, to compensate for the time until the next update is performed. None of the prior approaches can efficiently tackle this problem.
5 Optimizing for the Future
The problem of minimizing lifelong regret is straightforward if the agent has access to sufficient samples, in advance, from the future environment, , that it is going to face (where denotes the current episode number). That is, if we could estimate the startstate objective, , for the future MDP , then we could search for a policy whose performance is close to . However, obtaining even a single sample from the future is impossible, let alone getting a sufficient number of samples. This necessitates rethinking the optimization paradigm for searching for a policy that performs well when faced with the future unknown MDP. There are two immediate challenges here:

How can we estimate without any samples from ?

How can gradients, , of this future performance be estimated?
In this section we address both of these issues using the following idea. When the transition dynamics , and the reward functions are changing smoothly (ass:smoothness) the performances of any policy will also vary smoothly over time. The impact of smooth changes in the environment thus manifests as smooth changes in the performance of any policy, . In cases where there is an underlying, unknown, structure in the changes of the environment, one can now ask: if the performances of over the course of past episodes were known, can we analyze the trend in its past performances to find a policy that maximizes future performance ?
5.1 Forecasting Future Performance
In this section we address the first challenge of estimating future performance and pose it as a time series forecasting problem.
Broadly, this requires two components: (a) A procedure to compute past performances, , of . (b) A procedure to create an estimate, , of ’s future performance, , using these estimated values from the first component. An illustration of this idea is provided in Figure 1.
Component (a). As we do not have access to the past MDPs for computing the true values of past performances, , we propose computing estimates, , of them from the observed data. That is, in a nonstationary MDP, starting with the fixed transition matrix and the reward function , we want to estimate the performance of a given policy in episode . Leveraging the fact that the changes to the underlying MDP are due to an exogenous processes, we can estimate as,
(3) 
where and are also random variables. Next we describe how an estimate of can be obtained from (3) using information only from the episode.
To get an unbiased estimate,
, of ’s performance during episode , consider the past trajectory of the episode that was observed when executing a policy . By using counterfactual reasoning (Rosenbaum and Rubin, 1983) and leveraging the perdecision importance sampling (PDIS) estimator (Precup, 2000), an unbiased estimate of is thus given by:^{1}^{1}1We assume that the distribution of has full support over the set of all possible trajectories of the MDP .(4) 
It is worth noting that computing (4) does not require storing all the past policies
, one needs to only store the actions and the probabilities with which these actions were chosen.
Component (b). To obtain the second component, which captures the structure in and predicts future performances, we make use of a forecasting function that estimates future performance conditioned on the past performances:
(5) 
While can be any forecasting function, we consider
to be an ordinary least squares (OLS) regression model with parameters
, and the following input and output variables,(6)  
(7) 
For any , let denote a dimensional basis function for encoding the time index. For example, an identity basis , or a Fourier basis, where
Let be the corresponding basis matrix. The solution to above least squares problem is (Bishop, 2006) and the forecast of the future performance can be obtained using,
(8) 
This procedure enjoys an important advantage – by just using a univariate timeseries to estimate future performance, it bypasses the need for modeling the environment, which can be prohibitively hard or even impossible. Further, note that , where typically, and thus the cost of computing the matrix inverse is negligible. These advantages allows this procedure to gracefully scale to more challenging problems, while being robust to the size, , of the state set or the action set .
5.2 Differentiating Forecasted Future Performance
In the previous section, we addressed the first challenge and showed how to proactively estimate future performance, , of a policy by explicitly modeling the trend in its past performances . In this section, we address the second challenge to facilitate a complete optimization procedure. A pictorial illustration of the idea is provided in Figure 2.
Gradients for with respect to can be obtained as follows,
(9)  
(10) 
The decomposition in (10) has an elegant intuitive interpretation. The terms assigned to in (10) correspond to how the future prediction would change as a function of past outcomes, and the terms in indicate how the past outcomes would change due to changes in the parameters of the policy . In the next paragraphs, we discuss how to obtain the terms and .
To obtain term (a), note that in (8), corresponds to the element of , and so using (5) the gradients of the terms (a) in (10) are,
(11)  
(12) 
where represents the element of a vector . Therefore, (12) is the gradient of predicted future performance with respect to an estimated past performance.
The term in (10) corresponds to the gradient of the PDIS estimate of the past performance with respect to policy parameters . The following Property provides a form for (b) that makes its computation straightforward.
Property 1 (PDIS gradient).
Let .
(13) 
Proof.
See Appendix B. ∎
5.3 Algorithm
We provide a sketch of our proposed Prognosticator procedure for optimizing the future performance of the policy in Algorithm LABEL:apx:Alg:1
. To make the method more practical, we incorporated two additional modifications to reduce computational cost and variance.
First, it is often desirable to perform an update only after a certain episode interval to reduce computational cost. This raises the question: if a newly found policy will be executed for the next episodes, should we choose this new policy to maximize performance on just the single next episode, or to maximize the average performance over the next episodes? An advantage of our proposed method is that we can easily tune how far in the future we want to optimize for. Thus, to minimize lifelong regret, we propose optimizing for the mean performance over the next episodes. That is, .
Second, notice that if the policy becomes too close to deterministic, there would be two undesired consequences. (a) The policy will not cause exploration, precluding the agent from observing any changes to the environment in states that it no longer revisits—changes that might make entering those states worthwhile to the agent. (b) In the future when estimating using the past performance of , importance sampling will have high variance if the policy executed during episode is close to deterministic. To mitigate these issues, we add an entropy regularizer during policy optimization. More details are available in Appendix D.
algocf[t]
6 Understanding the Behavior of Prognosticator
Notice that as the scalar term (a) is multiplied by the PDIS gradient term (b) in (10), the gradient of future performance can be viewed as a weighted sum of offpolicy policy gradients. In Figure 3, we provide visualization of the weights for PDIS gradients of each episode , when the performance for episode is forecasted using data from the past episodes. For the specific setting when is an OLS estimator, these weights are independent of in (12) and their pattern remains constant for any given sequence of MDPs.
Importantly, note the occurrence of negative weights in Figure 3 when the identity basis or Fourier basis is used, suggesting that the optimization procedure should move towards a policy that had lower performance in some of the past episodes. While this negative weighting seems unusual at first glance, it has an intriguing interpretation. We discuss this in the following paragraphs.
To better understand these negative weights, consider a qualitative comparison when weights from different methods in Figure 3 are used along with the performance estimates of policies and in Figure 1. Despite having lower estimates of return everywhere, ’s rising trend suggests that it might have higher performance in the future, that is, . Existing online learning methods like FTL, maximize performance on all the past data uniformly (green curve in Figure 3). Similarly, the exponential weights (red curve in Figure 3) are representative of approaches that only optimize using data from recent episodes and discard previous data. Either of these methods that use only nonnegative weights can never capture the trend to forecast . However, the weights obtained when using the identity basis would facilitate minimization of performances in the distant past and maximization of performance in the recent past. Intuitively, this means that it moves towards a policy whose performance is on a linear rise, as it expects that policy to have better performance in the future.
While weights from the identity basis are useful for forecasting whether , it cannot be expected that the trend will always be linear as in Figure 1. To be more flexible and allow for any smooth trend, we opt to use the Fourier basis in our experiments. Observe the alternating sign of weights in Figure 3 when using the Fourier basis. This suggests that the optimization procedure will take into account the sequential differences in performances over the past, thereby favoring the policy that has shown the most performance increments in the past. This also avoids restricting the performance trend of a policy to be linear.
7 Mitigating Variance
While modelfree algorithms for finding a good policy are scalable to large problems, they tend to suffer from highvariance (Greensmith et al., 2004). In particular, the use of importance sampling estimators can increase the variance further (Guo et al., 2017). In our setup, high variance in estimates of past performances of can hinder capturing ’s performance trend, thereby making the forecasts less reliable.
Notice that a major source of variance is the availability of only a single trajectory sample per MDP , for all . If this trajectory , generated using is likely when using , but has nearzero probability when using then the estimated is also nearly zero. While is an unbiased estimate of , information provided by this single is of little use to evaluate . Subsequently, discarding this from timeseries analysis, rather than setting it to be
, can make the time series forecast more robust against outliers. In comparison, if trajectory
is unlikely when using but likely when using , then not only is very useful for estimating but it also has a lower chance of occurring in the future, so this trajectory must be emphasized when making a forecast. Such a process of (de)emphasizing estimates of past returns using the collected data itself can introduce bias, but this bias might be beneficial in this fewsample regime.To capture this idea formally, we build upon the insights of Hachiya et al. (2012) and Mahmood et al. (2014), who draw an equivalence between weighted leastsquares (WLS) estimation and the weighted importance sampling (WIS) (Precup, 2000) estimator. Particularly, let be the discounted return of the trajectory observed from a stationary MDP, and be the importance ratio of the entire trajectory. The WIS estimator, , of the performance of in a stationary MDP can then be obtained as,
(14) 
To mitigate variance in our setup, we propose extending WIS. In the nonstationary setting, to perform WIS while capturing the trend in performance over time, we use a modified forecasting function , which is a weighted leastsquares regression model with a dimensional basis function , and parameters ,
(15) 
Let be a diagonal weight matrix such that , let be the basis matrix, and let the following be input and output variables,
(16)  
(17) 
The solution to the weighted least squares problem in (15) is then given by and the forecast of the future performance can be obtained using,
has several desired properties. It incorporates a notion of how relevant each observed trajectory is towards forecasting, while also capturing the desired trend in performance. The forecasts are less sensitive to the importance sampling variances and the entire process is still fully differentiable.
8 Strictly Generalizing the Stationary Setting
As the agent is unaware of how the environment is changing, a natural question to ask is what if the agent wrongly assumed a stationary environment was nonstationary? What is the quality of of the agent’s performance forecasts? What is the impact of the negative weights on past evaluations of a policy’s performance? Here we answer these questions.
Before stating the formal results, we introduce some necessary notation and two assumptions. Let be the performance of policy for a stationary MDP. Let and be the nonstationary importance sampling (NIS) and nonstationary weighted importance sampling (NWIS) estimators of performance episodes in future. Further, let the basis function used for encoding the time index in both and be such that it satisfies the following conditions: (a) always contains to incorporate a bias/intercept coefficient in leastsquares regression (for example, , where are arbitrary functions). (b) has full column rank such that exists. Both these properties are trivially satisfied by most basis functions.
With this notation and these assumptions, we then have the following results indicating that NIS is unbiased like ordinary importance sampling and NWIS is biased like weighted importance sampling.
Theorem 1 (Unbiased NIS).
For all , is an unbiased estimator of , that is . thm:unbiasedNIS
Theorem 2 (Biased NWIS).
For all , may be a biased estimator of , that is, always. thm:unbiasedNWIS
Theorem 3 (Consistent NIS).
For all , is a consistent estimator of , that is as thm:consistentNIS
Theorem 4 (Consistent NWIS).
For all , is a consistent estimator of , that is as thm:consistentNWIS
Proof.
See Appendix A for all of these proofs. ∎
Since NWIS is biased and consistent like the WIS estimator, we expect it to have similar variance reduction properties that can potentially make the optimization process more efficient in a nonstationary MDP.
9 Empirical Analysis
This section presents empirical evaluations using several environments inspired by realworld applications that exhibit nonstationarity. In the following paragraphs, we briefly discuss each environment; a more detailed description is available in Appendix D.
Nonstationary Diabetes Treatment:
trials for the diabetes treatment environment. Error bars correspond to the standard error. The xaxis represents how fast the environment is changing and the yaxis represents regret (lower is better). Individual learning curves for each speed, for each domain, is available in Appendix
E.This environment is based on an opensource implementation
(Xie, 2019) of the FDA approved Type1 Diabetes Mellitus simulator (T1DMS) (Man et al., 2014) for treatment of Type1 Diabetes. Each episode consists of a day in an insilico patient’s life. Consumption of a meal increases the bloodglucose level in the body and if the bloodglucose level becomes too high, then the patient suffers from hyperglycemia and if the level becomes too low, then the patient suffers from hypoglycemia. The goal is to control the bloodglucose level of a patient by regulating the insulin dosage to minimize the risk associated with both hyper and hypoglycemia.However, the insulin sensitivity of a patient’s internal body organs vary over time, inducing nonstationarity that should be accounted for. In the T1DMS simulator, we induce this nonstationarity by oscillating the body parameters (e.g., insulin sensitivity, rate of glucose absorption, etc.) between two known configurations available in the simulator.
Nonstationary Recommender System: In this environment a recommender engine interacts with a user whose interests in different items fluctuate over time. In particular, the rewards associated with each item vary in seasonal cycles. The goal is to maximize revenue by recommending an item that the user is most interested in at any time.
Nonstationary Goal Reacher: This is a 2D environment with four (left, right, up, and down) actions and a continuous state set representing the Cartesian coordinates. The goal is to make the agent reach a moving goal position.
For all of the above environments, we regulate the speed of nonstationarity to characterize an algorithms’ ability to adapt. Higher speed corresponds to a greater amount of nonstationarity; A speed of zero indicates that the environment is stationary.
We consider the following algorithms for comparison:
Prognosticator: Two variants of our algorithm, ProOLS and ProWLS, which use OLS and WLS estimators for .
ONPG: Similar to the adaptation technique presented by AlShedivat et al. (2017), this baseline performs purely online optimization by finetuning the existing policy using only the trajectory being observed online.
FTRLPG: Similar to the adaptation technique presented by Finn et al. (2019), this baseline performs Followthe(regularized)leader optimization by maximizing performance over both the current and all the past trajectories.
9.1 Results
In the nonstationary recommender system, as the exact value of is available from the simulator, we can compute the true value of regret. However, for the nonstationary goal reacher and diabetes treatment environment, as is not known for any , we use a surrogate measure for regret. That is, let be the maximum return obtained in episode by any algorithm, then we use as the surrogate regret for a policy .
In the nonstationary recommender system, all the methods perform nearly the same when the environment is stationary. FTRLPG has a slight edge over ONPG when the environment is stationary as all the past data is directly indicative of the future MDP. It is interesting to note that while FTRLPG works the best for the stationary setting in the recommender system and the goal reacher task, it is not the best in the diabetes treatment task as it can suffer from high variance. We discuss the impact of variance in later paragraphs.
With the increase in the speed of nonstationarity, performance of both the baselines deteriorate quickly. Of the two, ONPG is better able to mitigate performance lag as it discards all the past data. In contrast, both the proposed methods, ProOLS and ProWLS, can leverage all the past data to better capture the impact of nonstationarity and thus are consistently more robust to the changes in the environment.
In the nonstationary goal reacher environment, a similar trend as above is observed. While considering all the past data equally is useful for FTRLPG in the stationary setting, it creates drastic performance lag as the speed of the nonstationarity increases. Between ProOLS and ProWLS, in the stationary setting, once the agent nearly solves the task all subsequent trajectories come from nearly the same distribution and thus the variance resulting from importance sampling ratio is not severe. In such a case, where the variance is low, ProWLS has less advantage over ProOLS and additionally suffers from being biased. However, as the nonstationarity increases, the optimal policy keeps changing and there is a higher discrepancy between distributions of past and current trajectories. This makes the lower variance property of ProWLS particularly useful. Having the ability to better capture the underlying trend, both ProOLS and ProWLS consistently perform better than the baselines when there is nonstationarity.
The nonstationary diabetes treatment environment is particularly challenging as it has a continuous action set. This makes importance sampling based estimators subject to much higher variance. Consequently, ProOLS is not able to reliably capture the impact of nonstationarity and performs similar to ONPG. In comparison, the combination of both high variance and performance lag makes FTRLPG perform poorly across all the speeds. The most advantageous algorithm in this environment is ProWLS. As it is designed to better tackle variance stemming from importance sampling, ProWLS is able to efficiently use the past data to capture the underlying trend and performs well across all the speeds of nonstationarity.
10 Conclusion
We presented a policy gradientbased algorithm that combines counterfactual reasoning with curvefitting to proactively search for a good policy for future MDPs. Irrespective of the environment being stationary or nonstationary, the proposed method can leverage all the past data, and in nonstationary settings it can proactively optimize for future performance as well. Therefore, our method provides a single solution for mitigating performance lag and being dataefficient.
While the proposed algorithm has several desired properties, many open questions remain. In our experiments, we noticed that the proposed algorithm is particularly sensitive to the value of the entropy regularizer . Keeping too high prevents the policy from adapting quickly. Keeping too low lets the policy overfit to the forecast and become close to deterministic, thereby increasing the variance for subsequent importance sampling estimates of policy performance. While we resorted to hyperparameter search, leveraging methods that adapt automatically might be fruitful (Haarnoja et al., 2018).
Our framework highlights new research directions for studying biasvariance tradeoffs in the nonstationary setting. While tackling the problem from the point of view of a univariate timeseries is advantageous as the modelbias of the environment can be reduced, this can result in higher variance in the forecasted performance. Developing lower variance offpolicy performance estimators for is also an active research direction which directly complements our algorithm. In particular, often a partial model of the environment is available and using it through doublyrobust estimators (Jiang and Li, 2015; Thomas and Brunskill, 2016) is an interesting future direction.
Further, there are other forecasting functions, like kernel regression, Gaussian Processes, ARIMA, etc., and some breakpoint detection algorithms that can potentially be used to incorporate more domain knowledge in the forecasting function , or make robust to jumps in the time series.
11 Acknowledgement
Part of the work was done when the first author was an intern at Adobe Research, San Jose. The research was later supported by generous gifts from Adobe Research. We are thankful to Ian Gemp, Scott M. Jordan, and Chris Nota for insightful discussions and for providing valuable feedback.
References

Abbasi et al. (2013)
Y. Abbasi, P. L. Bartlett, V. Kanade, Y. Seldin, and C. Szepesvári.
Online learning in Markov decision processes with adversarially chosen transition probability distributions.
In Advances in Neural Information Processing Systems, pages 2508–2516, 2013. 
Abdallah and Kaisers (2016)
S. Abdallah and M. Kaisers.
Addressing environment nonstationarity by repeating qlearning
updates.
The Journal of Machine Learning Research
, 2016.  AlShedivat et al. (2017) M. AlShedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, and P. Abbeel. Continuous adaptation via metalearning in nonstationary and competitive environments. arXiv preprint arXiv:1710.03641, 2017.
 Basso and Engel (2009) E. W. Basso and P. M. Engel. Reinforcement learning in nonstationary continuous time and space scenarios. In Artificial Intelligence National Meeting, volume 7, pages 1–8. Citeseer, 2009.
 Bastani (2014) M. Bastani. Modelfree intelligent diabetes management using machine learning. M.S. Thesis, University of Alberta, 2014.
 Besbes et al. (2014) O. Besbes, Y. Gur, and A. Zeevi. Stochastic multiarmedbandit problem with nonstationary rewards. In Advances in Neural Information Processing Systems, pages 199–207, 2014.
 Bishop (2006) C. M. Bishop. Pattern recognition and machine learning. springer, 2006.
 Bowling (2005) M. Bowling. Convergence and noregret in multiagent learning. In Advances in Neural Information Processing Systems, pages 209–216, 2005.
 Cheevaprawatdomrong et al. (2007) T. Cheevaprawatdomrong, I. E. Schochetman, R. L. Smith, and A. Garcia. Solution and forecast horizons for infinitehorizon nonhomogeneous Markov decision processes. Mathematics of Operations Research, 32(1):51–72, 2007.
 Cheung et al. (2019) W. C. Cheung, D. SimchiLevi, and R. Zhu. Reinforcement learning under drift. arXiv preprint arXiv:1906.02922, 2019.
 Choi et al. (2000) S. P. Choi, D.Y. Yeung, and N. L. Zhang. An environment model for nonstationary reinforcement learning. In Advances in Neural Information Processing Systems, pages 987–993, 2000.
 Conitzer and Sandholm (2007) V. Conitzer and T. Sandholm. Awesome: A general multiagent learning algorithm that converges in selfplay and learns a best response against stationary opponents. Machine Learning, 67(12):23–43, 2007.
 Cuzick (1995) J. Cuzick. A strong law for weighted sums of i.i.d. random variables. Journal of Theoretical Probability, 8(3):625–641, 1995.
 DulacArnold et al. (2019) G. DulacArnold, D. Mankowitz, and T. Hester. Challenges of realworld reinforcement learning. arXiv preprint arXiv:1904.12901, 2019.
 EvenDar et al. (2005) E. EvenDar, S. M. Kakade, and Y. Mansour. Experts in a Markov decision process. In Advances in Neural Information Processing Systems, pages 401–408, 2005.
 Finn et al. (2019) C. Finn, A. Rajeswaran, S. Kakade, and S. Levine. Online metalearning. arXiv preprint arXiv:1902.08438, 2019.
 Foerster et al. (2018) J. Foerster, R. Y. Chen, M. AlShedivat, S. Whiteson, P. Abbeel, and I. Mordatch. Learning with opponentlearning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 122–130. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
 Foster et al. (2016) D. J. Foster, Z. Li, T. Lykouris, K. Sridharan, and E. Tardos. Learning in games: Robustness of fast convergence. In Advances in Neural Information Processing Systems, pages 4734–4742, 2016.
 Gajane et al. (2018) P. Gajane, R. Ortner, and P. Auer. A slidingwindow algorithm for Markov decision processes with arbitrarily changing rewards and transitions. arXiv preprint arXiv:1805.10066, 2018.
 Garcia and Smith (2000) A. Garcia and R. L. Smith. Solving nonstationary infinite horizon dynamic optimization problems. Journal of Mathematical Analysis and Applications, 244(2):304–317, 2000.

Ghate and Smith (2013)
A. Ghate and R. L. Smith.
A linear programming approach to nonstationary infinitehorizon Markov decision processes.
Operations Research, 61(2):413–425, 2013.  Greene (2003) W. H. Greene. Econometric analysis. Pearson Education India, 2003.
 Greensmith et al. (2004) E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.
 Guo et al. (2017) Z. Guo, P. S. Thomas, and E. Brunskill. Using options and covariance testing for long horizon offpolicy policy evaluation. In Advances in Neural Information Processing Systems, pages 2492–2501, 2017.
 Haarnoja et al. (2018) T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. Soft actorcritic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.

Hachiya et al. (2012)
H. Hachiya, M. Sugiyama, and N. Ueda.
Importanceweighted leastsquares probabilistic classifier for covariate shift adaptation with application to human activity recognition.
Neurocomputing, 80:93–101, 2012.  Hopp et al. (1987) W. J. Hopp, J. C. Bean, and R. L. Smith. A new optimality criterion for nonhomogeneous Markov decision processes. Operations Research, 35(6):875–883, 1987.
 Jacobsen et al. (2019) A. Jacobsen, M. Schlegel, C. Linke, T. Degris, A. White, and M. White. Metadescent for online, continual prediction. In AAAI Conference on Artificial Intelligence, 2019.
 Jagerman et al. (2019) R. Jagerman, I. Markov, and M. de Rijke. When people change their mind: Offpolicy evaluation in nonstationary recommendation environments. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, Melbourne, VIC, Australia, February 1115, 2019, 2019.
 Jiang and Li (2015) N. Jiang and L. Li. Doubly robust offpolicy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722, 2015.
 Jong and Stone (2005) N. K. Jong and P. Stone. Bayesian models of nonstationary Markov decision processes. Planning and Learning in A Priori Unknown or Dynamic Domains, page 132, 2005.
 Kearney et al. (2018) A. Kearney, V. Veeriah, J. B. Travnik, R. S. Sutton, and P. M. Pilarski. TIDBD: Adapting temporaldifference stepsizes through stochastic metadescent. arXiv preprint arXiv:1804.03334, 2018.
 Lecarpentier and Rachelson (2019) E. Lecarpentier and E. Rachelson. Nonstationary Markov decision processes a worstcase approach using modelbased reinforcement learning. arXiv preprint arXiv:1904.10090, 2019.
 Levine et al. (2017) N. Levine, K. Crammer, and S. Mannor. Rotting bandits. In Advances in Neural Information Processing Systems, pages 3074–3083, 2017.
 Li and de Rijke (2019) C. Li and M. de Rijke. Cascading nonstationary bandits: Online learning to rank in the nonstationary cascade model. arXiv preprint arXiv:1905.12370, 2019.
 Li et al. (2019) Y. Li, A. Zhong, G. Qu, and N. Li. Online Markov decision processes with timevarying transition probabilities and rewards. In Realworld Sequential Decision Making workshop at ICML 2019, 2019.
 Lu et al. (2019) K. Lu, I. Mordatch, and P. Abbeel. Adaptive online planning for continual lifelong learning. arXiv preprint arXiv:1912.01188, 2019.
 Mahmood et al. (2014) A. R. Mahmood, H. P. van Hasselt, and R. S. Sutton. Weighted importance sampling for offpolicy learning with linear function approximation. In Advances in Neural Information Processing Systems, pages 3014–3022, 2014.

Mahmud and Ramamoorthy (2013)
M. Mahmud and S. Ramamoorthy.
Learning in nonstationary mdps as transfer learning.
In Proceedings of the 2013 international conference on Autonomous agents and multiagent systems, pages 1259–1260. International Foundation for Autonomous Agents and Multiagent Systems, 2013.  Man et al. (2014) C. D. Man, F. Micheletto, D. Lv, M. Breton, B. Kovatchev, and C. Cobelli. The UVA/PADOVA type 1 diabetes simulator: New features. Journal of Diabetes Science and Technology, 8(1):26–34, 2014.
 Mealing and Shapiro (2013) R. Mealing and J. L. Shapiro. Opponent modelling by sequence prediction and lookahead in twoplayer games. In International Conference on Artificial Intelligence and Soft Computing, pages 385–396. Springer, 2013.
 Mnih et al. (2013) V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mohri and Yang (2016) M. Mohri and S. Yang. Accelerating online convex optimization via adaptive prediction. In Artificial Intelligence and Statistics, pages 848–856, 2016.
 Moore et al. (2014) B. L. Moore, L. D. Pyeatt, V. Kulkarni, P. Panousis, K. Padrez, and A. G. Doufas. Reinforcement learning for closedloop propofol anesthesia: A study in human volunteers. The Journal of Machine Learning Research, 15(1):655–696, 2014.
 Moulines (2008) E. Moulines. On upperconfidence bound policies for nonstationary bandit problems. arXiv preprint arXiv:0805.3415, 2008.
 Nagabandi et al. (2018a) A. Nagabandi, I. Clavera, S. Liu, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn. Learning to adapt in dynamic, realworld environments through metareinforcement learning. arXiv preprint arXiv:1803.11347, 2018a.
 Nagabandi et al. (2018b) A. Nagabandi, C. Finn, and S. Levine. Deep online learning via metalearning: Continual adaptation for modelbased rl. arXiv preprint arXiv:1812.07671, 2018b.
 Ornik and Topcu (2019) M. Ornik and U. Topcu. Learning and planning for timevarying mdps using maximum likelihood estimation. arXiv preprint arXiv:1911.12976, 2019.
 Padakandla et al. (2019) S. Padakandla, P. K. J., and S. Bhatnagar. Reinforcement learning in nonstationary environments. CoRR, abs/1905.03970, 2019.
 Precup (2000) D. Precup. Eligibility traces for offpolicy policy evaluation. Computer Science Department Faculty Publication Series, 2000.
 Rakhlin and Sridharan (2013) A. Rakhlin and K. Sridharan. Online learning with predictable sequences. arXiv preprint arXiv:1208.3728, 2013.
 Ring (1994) M. B. Ring. Continual learning in reinforcement environments. PhD thesis, University of Texas at Austin, Texas 78712, 1994.
 Rosenbaum and Rubin (1983) P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
 Schmidhuber (1999) J. Schmidhuber. A general method for incremental selfimprovement and multiagent learning. In Evolutionary Computation: Theory and Applications, pages 81–123. World Scientific, 1999.
 Schulman et al. (2017) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Seznec et al. (2018) J. Seznec, A. Locatelli, A. Carpentier, A. Lazaric, and M. Valko. Rotting bandits are no harder than stochastic ones. arXiv preprint arXiv:1811.11043, 2018.
 ShalevShwartz et al. (2012) S. ShalevShwartz et al. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012.
 Silver et al. (2017) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of Go without human knowledge. Nature, 550(7676):354–359, 2017.
 Singh et al. (2000) S. Singh, M. Kearns, and Y. Mansour. Nash convergence of gradient dynamics in generalsum games. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pages 541–548. Morgan Kaufmann Publishers Inc., 2000.
 Sinha and Ghate (2016) S. Sinha and A. Ghate. Policy iteration for robust nonstationary Markov decision processes. Optimization Letters, 10(8):1613–1628, 2016.
 Sutton and Barto (2018) R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
 Thomas and Brunskill (2016) P. Thomas and E. Brunskill. Dataefficient offpolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148, 2016.
 Thomas (2015) P. S. Thomas. Safe reinforcement learning. PhD thesis, University of Massachusetts Libraries, 2015.
 Thomas et al. (2017) P. S. Thomas, G. Theocharous, M. Ghavamzadeh, I. Durugkar, and E. Brunskill. Predictive offpolicy policy evaluation for nonstationary decision problems, with applications to digital marketing. In TwentyNinth Innovative Applications of Artificial Intelligence Conference, 2017.
 Thrun (1998) S. Thrun. Lifelong learning algorithms. In Learning to learn, pages 181–209. Springer, 1998.
 Wagener et al. (2019) N. Wagener, C.A. Cheng, J. Sacks, and B. Boots. An online learning approach to model predictive control. arXiv preprint arXiv:1902.08967, 2019.
 Wang et al. (2019a) J.K. Wang, X. Li, and P. Li. Optimistic adaptive acceleration for optimization. arXiv preprint arXiv:1903.01435, 2019a.
 Wang et al. (2019b) L. Wang, H. Zhou, B. Li, L. R. Varshney, and Z. Zhao. Be aware of nonstationarity: Nearly optimal algorithms for piecewisestationary cascading bandits. arXiv preprint arXiv:1909.05886, 2019b.
 Xie (2019) J. Xie. Simglucose v0.2.1 (2018), 2019. URL https://github.com/jxx123/simglucose.
 Yang and Mohri (2016) S. Yang and M. Mohri. Optimistic bandit convex optimization. In Advances in Neural Information Processing Systems, pages 2297–2305, 2016.

Yu and Mannor (2009)
J. Y. Yu and S. Mannor.
Online learning in Markov decision processes with arbitrarily
changing rewards and transitions.
In
2009 International Conference on Game Theory for Networks
, pages 314–322. IEEE, 2009.  Zhang and Lesser (2010) C. Zhang and V. Lesser. Multiagent learning with policy prediction. In Twentyfourth AAAI conference on artificial intelligence, 2010.
Appendix
Appendix A Properties of NIS and NWIS Estimators
Here we provide proofs for the properties of NIS and NWIS estimators. While NIS and NWIS are developed for nonstationary setting, these properties ensure that these estimators strictly generalize for the stationary setting as well. That is, when used in stationary setting, NIS estimator is both unbiased and consistent like the PDIS estimator, and NWIS estimator is biased and consistent like the WIS estimator; when NIS and NWIS are used in nonstationary setting they can provide more accurate estimates of a future performance as they explicitly model the trend of policy’s performance over time.
Our proof technique draws inspiration from the results presented by Mahmood et al. (2014). The primary difference is that their results are for estimators that are developed only for the stationary setting and can not be readily used for the nonstationary setting, unlike ours. The key modification that we make to leverage their proof technique is that instead of using the features of the state as the input and the observed return from that corresponding state as the output to the regression function, we use the features of the time index of an episode as the input and the observed return for that corresponding episode as the output. In their setup, as states are stochastically drawn from a distribution their analysis is not directly applicable for our setting, where inputs (time indices) form a deterministic sequence. For analysis of our estimators, we leverage techniques discussed by Greene (2003) for analyzing properties of the ordinary least squares estimator.
Before proceeding, we impose the following constraints on the set of policies, and the basis function used for encoding the time index in both and .

always contains to incorporate a bias coefficient in leastsquares regression (for example, , where are arbitrary functions).

There exists a finite constant , such that .

has full column rank such that exists.

We only consider set of policies that have nonzero probability of taking any action in any state. That is , such that
Satisfying condition (a) is straightforward as it is already done by all basis functions. Intuitively, this constraint ensure that the regression problem is not illdefined and there exists a model in the model class that can capture a fixed constant, which corresponds to absence of any trend. This is useful for our purpose as in the stationary setting, there exists no trend in the expected performance between the episodes for any given policy. The forecaster should be capable to infer this fact from the data and represent it. That is, the optimal model parameter is ^{2}^{2}2If the domain knowledge is available to select an appropriate basis function that can be used to represent the performance trend of all the policies for the required nonstationary environment, then all the following finitesample and largesample properties can be extended for that environment as well.
(18) 
such that for any ,
(19) 
Conditions (b) and (c) are also readily satisfied by popular basis functions. For example, features obtained using Fourier basis is bounded by , and features from polynomial/identity basis are also bounded when inputs are adequately normalized. Further, when the basis function does not repeat any feature, and number of samples are more than number of features, condition (c) is satisfied. This ensures the the leastsquares problem is welldefined and has a uniquesolution.
Condition (d) ensures that the denominator in any importance ratio is always bounded below, such that the importance ratios are bounded above. This condition implies that the importance sampling estimator for any evaluation policy has finite variance. Use of entropy regularization with common policy parameterizations (softmax/Gaussian) can prevent violation of this condition.
a.1 Finite Sample Properties
In this subsection, finite sample properties of NIS and NWIS are presented. Specifically, it is established that NIS is an unbiased estimator, whereas NWIS is a biased estimator of , the performance of a policy in a stationary MDP.
Theorem 1 (Unbiased NIS).
For all , is an unbiased estimator of , that is . apx:thm:unbiasedNIS
Proof.
Proof.
(Alternate) Here we present an alternate proof for apx:thm:unbiasedNIS which does not require invoking .
(28)  
(29)  
(30) 
where (a) is the dot product written as summation, and (b) holds because the multiplicative constants are fixed values, as given in (12). Since the environment is stationary, , therefore,
(31) 
In the following we focus on the terms inside the summation in (31). Without loss of generality, assume that the for a given matrix of features , the feature corresponding to value is in the last column of . Let , and let be the submatrix of such that has all features of except the ones column, . Let
be the identity matrix
, then it can seen that can be expressed as,(32) 
Theorem 2 (Biased NWIS).
For all , may be a biased estimator of , that is, always. apx:thm:unbiasedNWIS
Proof.
We prove this result using a simple counterexample. Consider the following basis function, , then,
(42)  
(43)  
(44)  
(45) 
which is the WIS estimator. Therefore, as WIS is a biased estimator, NWIS is also a biased estimator of . ∎
a.2 Large Sample Properties
In this subsection, large sample properties of NIS and NWIS are presented. Specifically, it is established that both NIS and NWIS are consistent estimators of, , the performance of a policy in a stationary MDP.
Theorem 3 (Consistent NIS).
For all , is a consistent estimator of , that is as thm:consistentNIS
Proof.
Using (8),
(46)  
(47) 
As and the MDP is stationary, each element of is an unbiased estimate of . In other words, , where is a mean zero error. Let be the vector containing all the error terms . Now using (19),
(48)  
(49)  
(50)  
(51)  
(52) 
Using Slutsky’s Theorem,
(53)  
(54) 
where , and it holds from Grenander’s conditions that exists. Informally, Grenander’s conditions require that no feature degenerates to a sequence of zeros, no feature of a single observation dominates the sum of squares of its series, and the matrix always has full rank. These conditions are easily satisfied for most popular basis functions used to create input features. For formal definitions of these conditions, we refer the reader to Chpt. 5, Greene (2003).
In the following, we restrict our focus to the term inside the brackets in the second term of (54). The mean of that term is,
(55) 
Since the mean is , variance of that term is given by,
(56) 
As each policy has a nonzero probability of taking any action in any state, the variance of PDIS (or the standard IS) estimator is bounded and thus each element of is bounded. Further, as is bounded, each element of is also bounded. Therefore, as
(57) 
As mean is and variance asymptotes to , then as . Combining this with (54),