1 Introduction
Realworld problems are often nonstationary. That is, parts of the problem specification change over time. We desire autonomous systems that continually adapt by capturing the regularities in such changes, without the need to learn from scratch after every change. In this work, we address one form of lifelong learning
for sequential decision making problems, wherein the set of possible actions (decisions) varies over time. Such a situation is omnipresent in realworld problems. For example, in robotics it is natural to add control components over the lifetime of a robot to enhance its ability to interact with the environment. In hierarchical reinforcement learning, an agent can create new
options (Sutton et al., 1999) over its lifetime, which are in essence new actions. In medical decision support systems for drug prescription, new procedures and medications are continually discovered. In email marketing, new promotional emails are often added based on current trends. In product recommender systems, new products are constantly added to the stock, and in tutorial recommendation systems, new tutorials are regularly developed, thereby continuously increasing the number of available actions for a recommender engine. These examples capture the broad idea that, for an agent that is deployed in real world settings, the possible decisions it can make changes over time, and motivates the question that we aim to answer: how do we develop algorithms that can continually adapt to such changes in the action set over the agent’s lifetime?Reinforcement learning (RL) has emerged as a successful class of methods for solving sequential decision making problems. However, its applications have been limited to settings where the set of actions is fixed. This is likely because RL algorithms are designed to solve a mathematical formalization of decision problems called Markov decision processes (MDPs) (Puterman, 2014), wherein the set of available actions is fixed. To begin addressing our lifelong learning problem, we first extend the standard MDP formulation to incorporate this aspect of changing action set size. Motivated by the regularities in realworld problems, we consider an underlying, unknown, structure in the space of actions from which new actions are generated. We then theoretically analyze the difference between what an algorithm can achieve with only the actions that are available at one point in time, and the best that the algorithm could achieve if it had access to the entire underlying space of actions (and knew the structure of this space). Leveraging insights from this theoretical analysis, we then study how the structure of the underlying action space can be recovered from interactions with the environment, and how algorithms can be developed to use this structure to facilitate lifelong learning.
As in the standard RL setting, when facing a changing action set, the parameterization of the policy plays an important role. The key consideration here is how to parameterize the policy and adapt its parameters when the set of available actions changes. To address this problem, we leverage the structure in the underlying action space to parameterize the policy such that it is invariant to the cardinality of the action set—changing the number of available actions does not require changes to the number of parameters or the structure of the policy. Leveraging the structure of the underlying action space also improves generalization by allowing the agent to infer the outcomes of actions similar to actions already taken. These advantages make our approach ideal for lifelong learning problems where the action set changes dynamically over time, and where quick adaptation to these changes, via generalization of prior knowledge about the impact of actions, is beneficial.
2 Related Works
Lifelong learning is a well studied problem (Thrun, 1998; Ruvolo and Eaton, 2013; Silver et al., 2013; Chen and Liu, 2016). Predominantly, prior methods aim to address catastrophic forgetting problems in order to leverage prior knowledge for new tasks (French, 1999; Kirkpatrick et al., 2017; LopezPaz et al., 2017; Zenke et al., 2017). Alternatively, many lifelong RL methods consider learning online in the presence of changing transition dynamics or reward functions (Neu, 2013; Gajane et al., 2018). In our work, we look at a complementary aspect of the lifelong learning problem, wherein the actions available to the agent change over its lifetime.
Our work also draws inspiration from recent works which leverage action embeddings (DulacArnold et al., 2015; He et al., 2015; Bajpai et al., 2018; Chandak et al., 2019; Tennenholtz and Mannor, 2019). Building upon their ideas, we present a new objective for learning structure in the action space, and show that the performance of the policy resulting from using this inferred structure has bounded suboptimality. Moreover, in contrast to their setup where the size of the action set is fixed, we consider the case of lifelong MDP, where the number of actions changes over time.
In the work perhaps most closely related to ours, Mandel et al. (2017) consider the setting where humans can provide new actions to an RL system. The goal in their setup is to minimize human effort by querying for new actions only at states where new actions are most likely to boost performance. In comparison, our setup considers the case where the new actions become available through some external, unknown, process and the goal is to build learning algorithms that can efficiently adapt to such changes in the action set.
3 Lifelong Markov Decision Process
MDPs, the standard formalization of decision making problems, are not flexible enough to encompass lifelong learning problems wherein the action set size changes over time. In this section we extend the standard MDP framework to model this setting.
In realworld problems where the set of possible actions changes, there is often underlying structure in the set of all possible actions (those that are available, and those that may become available). For example, tutorial videos can be described by feature vectors that encode their topic, difficulty, length, and other attributes; in robot control tasks, primitive locomotion actions like left, right, up, and down could be encoded by their change to the Cartesian coordinates of the robot, etc. Critically, we will not assume that the agent knows this structure, merely that it exists. If actions are viewed from this perspective, then the set of all possible actions (those that are available at one point in time, and those that might become available at any time in the future) can be viewed as a vectorspace,
.To formalize the lifelong MDP, we first introduce the necessary variables that govern when and how new actions are added. Let
be a random variable that indicates whether a new set of actions are added or not at the start of episode
and let frequencybe the associated probability distribution over episode count, such that
. Let be the random variable corresponding to the set of actions that is added before the start of episode . When , we assume that , and when , we assume that . Let be the distribution of when , i.e., if . Notice that whether or not any actions are added can depend on the episode number (via ), the distribution from which new actions is sampled, , is fixed and does not depend on the episode number, , nor any behavior of the agent.Since we do not assume that the agent knows the structure associated with the action, we instead provide actions to the agent as a set of discrete entities, . To this end, we define to be a map relating the underlying structure of the new actions to the observed set of discrete actions for all , i.e., if , then . Naturally, for most problems of interest, neither the underlying structure , nor the distribution , nor the frequency of updates , nor the relation is known—the agent only has access to the observed set of discrete actions.
We now define the lifelong Markov decision process (LMDP) as , which extends a base MDP . is the set of all possible states that the agent can be in, called the state set. is the discrete set of actions available to the agent, and for we define this set to be empty, i.e., . When the set of available actions changes and the agent observes a new set of discrete actions, , then transitions to , such that in is the set union of in and . Apart from the available actions, other aspects of the LMDP remain the same throughout. An illustration of the framework is provided in Figure 1. We use , , and as random variables for denoting the state, action and reward at time within each episode . The first state, , comes from an initial distribution, , and the reward function is defined to be only dependent on the state such that for all . We assume that for some finite . The reward discounting parameter is given by . is the state transition function, such that for all , the function denotes the transition probability , where .^{1}^{1}1For notational ease, (a) we overload symbol for representing both probability mass and density; (b) we assume that the state set is finite, however, our primary results extend to MDPs with continuous states.
In the most general case, new actions could be completely arbitrary and have no relation to the ones seen before. In such cases, there is very little hope of lifelong learning by leveraging past experience. To make the problem more feasible, we resort to a notion of smoothness between actions. Formally, we assume that transition probabilities in an LMDP are Lipschitz in the structure of actions, i.e.,
(1) 
For any given MDP in , an agent’s goal is to find a policy, , that maximizes the expected sum of discounted future rewards. For any policy , the corresponding state value function is .
4 Blessing of Changing Actions Sets
Finding an optimal policy when the set of possible actions is large is difficult due to the curse of dimensionality. In the LMDP setting this problem might appear to be exacerbated, as an agent must additionally adapt to the changing levels of possible performance as new actions become available. This raises the natural question: as new actions become available, how much does the performance of an optimal policy change? If it fluctuates significantly, can a lifelong learning agent succeed by continuously adapting its policy, or is it better to learn from scratch with every change to the action set?
To answer this question, consider an optimal policy, , for MDP , i.e., an optimal policy when considering only policies that use actions that are available during the episode. We now quantify how suboptimal is relative to the performance of a hypothetical policy, , that acts optimally given access to all possible actions.
Theorem 1.
In an LMDP, let denote the maximum distance in the underlying structure of the closest pair of available actions, i.e., , then
(2) 
Proof.
See Appendix A. ∎
With a bound on the maximum possible suboptimality, Theorem 1 presents an important connection between achievable performances, the nature of underlying structure in the action space, and a property of available actions in any given . Using this, we can make the following conclusion.
Corollary 1.
Let be the smallest closed set such that, . We refer to as the elementwisesupport of . If the elementwisesupport of in an LMDP is , then as the suboptimality vanishes. That is,
Proof.
See Appendix A. ∎
Through Corollary 1, we can now establish that the change in optimal performance will eventually converge to zero as new actions are repeatedly added. An intuitive way to observe this result would be to notice that every new action that becomes available indirectly provides more information about the underlying, unknown, structure of . However, in the limit, as the size of the available action set increases, the information provided by each each new action vanishes and thus performance saturates.
Certainly, in practice, we can never have , but this result is still advantageous. Even when the underlying structure , the distribution , the change frequency , and the mapping relation
are all unknown, it establishes the fact that the difference between the best performances in successive changes will remain bounded and will not fluctuate arbitrarily. This opens up new possibilities for developing algorithms that do not need to start from scratch after new actions are added, but rather can build upon their past experiences using updates to their existing policies that efficiently leverage estimates of the structure of
to adapt to new actions.5 Learning with Changing Action Sets
Theorem 1 characterizes what can be achieved in principle, however, it does not specify how to achieve it—how to find efficiently. Using any parameterized policy, , which acts directly in the space of observed actions, suffers from one key practical drawback in the LMDP setting. That is, the parameterization is deeply coupled with the number of actions that are available. That is, not only is the meaning of each parameter coupled with the number of actions, but often the number of parameters that the policy has is dependent on the number of possible actions. This makes it unclear how the policy should be adapted when additional actions become available. A trivial solution would be to ignore the newly available actions and continue only using the previously available actions. However, this is clearly myopic, and will prevent the agent from achieving the better long term returns that might be possible using the new actions.
To address this parameterizationproblem, instead of having the policy, , act directly in the observed action space, , we propose an approach wherein the agent reasons about the underlying structure of the problem in a way that makes its policy parameterization invariant to the number of actions that are available. In order to do so, we split the policy parameterization into two components. The first component corresponds to the state conditional policy responsible for making the decisions, , where . The second component corresponds to , an estimator of the relation , which is used to map the output of to an action in the set of available actions. That is, an is sampled from and then is used to obtain the action . Together, and form a complete policy, and corresponds to the inferred structure in action space.
One of the prime benefits of estimating with is that it makes the parameterization of invariant to the cardinality of the action set—changing the number of available actions does not require changing the number of parameters of . Instead, only the parameterization of , the estimator of the underlying structure in action space, must be modified when new actions become available. We show next that the update to the parameters of can be performed using supervised learning methods that are independent of the reward signal and thus typically more efficient than RL methods.
While our proposed parameterization of the policy using both and has the advantages described above, the performance of is now constrained by the quality of , as in the end is responsible for selecting an action from . Ideally we want to be such that it lets be both: (a) invariant to the cardinality of the action set for practical reasons and (b) as expressive as a policy, , explicitly parameterized for the currently available actions. Similar tradeoffs have been considered in the context of learning optimal stateembeddings for representing subgoals in hierarchical RL (Nachum et al., 2018). For our lifelong learning setting, we build upon their method to efficiently estimate in a way that provides bounded suboptimality. Specifically, we make use of an additional inverse dynamics function, , that takes as input two states, and , and produces as output a prediction of which caused the transition from to . Since the agent does not know , when it observes a transition from to via action , it does not know which caused this transition. So, we cannot train to make good predictions using the actual action, , that caused the transition. Instead, we use to transform the prediction of from to , and train both and so that this process accurately predicts which action, , caused the transition from to . Moreover, rather than viewing as a deterministic function mapping states and to predictions , we define to be a distribution over given two states, and .
For any given in LMDP , let and denote the two components of the overall policy and let denote the best overall policy that can be represented using some fixed . The following theorem bounds the suboptimality of .
Theorem 2.
For an LMDP , If there exists a and such that
(3) 
where and , then
Proof.
See Appendix A. ∎
By quantifying the impact has on the suboptimality of achievable performance, Theorem 2 provides the necessary constraints for estimating . At a high level, Equation (3) ensures to be such that it can be used to generate an action corresponding to any to transition. This allows to leverage and choose the required action that induces the state transition needed for maximizing performance. Thereby, following (3), suboptimality would be minimized if and are optimized to reduce the supremum of KL divergence over all and . In practice, however, the agent does not have access to all possible states, rather it has access to a limited set of samples collected from interactions with the environment. Therefore, instead of the supremum, we propose minimizing the average over all and from a set of observed transitions,
(4) 
Equation (4) suggests that would be minimized when equals , but using (4) directly in the current form is inefficient as it requires computing KL over all probable for a given and . To make it practical, we make use of the following property.
Property 1.
For some constant C, is lower bounded by
(5) 
Proof.
See Appendix B. ∎
As minimizing is equivalent to maximizing , we consider maximizing the lower bound obtained from Property 1. In this form, it is now practical to optimize (5) just by using the observed samples. As this form is similar to the objective for variational autoencoder, inner expectation can be efficiently optimized using the reparameterization trick (Kingma and Welling, 2013). is the prior on , and we treat it as a hyperparameter that allows the KL to be computed in closed form.
Importantly, note that this optimization procedure only requires individual transitions, , and is independent of the reward signal. Hence, at its core, it is a supervised learning procedure. This means that learning good parameters for tends to require far fewer samples than optimizing (which is an RL problem). This is beneficial for our approach because , the component of the policy where new parameters need to be added when new actions become available, can be updated efficiently. As both and are invariant to action cardinality, they do not require new parameters when new actions become available. Additional implementation level details are available in Appendix D.3.
6 Algorithm
When a new set of actions, , becomes available, the agent should leverage the existing knowledge and quickly adapt to the new action set. Therefore, during every change in , the ongoing best components of the policy, and , in are carried over, i.e., and . For lifelong learning, the following property illustrates a way to organize the learning procedure so as to minimize the suboptimality in each , for all .
Property 2.
(Lifelong Adaptation and Improvement) In an LMDP, let denote the difference of performance between and the best achievable using our policy parameterization, then the overall suboptimality can be expressed as,
(6) 
where is used in the subscript to emphasize the respective MDP in .
Proof.
See Appendix C. ∎
Property 2 illustrates a way to understand the impact of and by splitting the learning process into an adaptation phase and a policy improvement phase. These two iterative phases are the crux of our algorithm for solving an LMDP . Based on this principle, we call our algorithm LAICA: lifelong adaptation and improvement for changing actions. We now briefly discuss the LAICA algorithm; a detailed description with pseudocode is presented in Appendix D.2.
Whenever new actions become available, adaptation is prone to cause a performance drop as the agent has no information about when to use the new actions, and so its initial uses of the new actions may be at inappropriate times. Following Property 1, we update so as to efficiently infer the underlying structure and minimize this drop. That is, for every , is first adapted to in the adaptation phase by adding more parameters for the new set of actions and then optimizing (5). After that, is fixed and is improved towards in the policy improvement phase, by updating the parameters of using the policy gradient theorem (Sutton et al., 2000). These two procedures are performed sequentially whenever transitions to , for all , in an LMDP . An illustration of the procedure is presented in Figure 2.
7 Empirical Analysis
In this section, we aim to empirically compare the following methods,

[leftmargin=*]

Baseline(1): The policy is reinitialised and the agent learns from scratch after every change.

Baseline(2): New parameters corresponding to new actions are added/stacked to the existing policy (and previously learned parameters are carried forward asis).

LAICA(1): The proposed approach that leverages the structure in the action space. To act in continuous space of inferred structure, we use DPG (Silver et al., 2014) to optimize .

LAICA(2): A variant of LAICA which uses an actorcritic (Sutton and Barto, 2018) to optimize .
To demonstrate the effectiveness of our proposed method(s) on lifelong learning problems, we consider a maze environment and two domains corresponding to realworld applications, all with a large set of changing actions. For each of these domains, the total number of actions were randomly split into five equal sets. Initially, the agent only had the actions available in the first set and after every change the next set of actions was made available additionally. In the following paragraphs we briefly outline the domains; full details are deferred to Appendix D.1
Maze Domain.
As a proofofconcept, we constructed a continuousstate maze environment where the state is comprised of the coordinates of the agent’s location and its objective is to reach a fixed goal state. The agent has a total of actions corresponding to displacements in different directions of different magnitudes. This domain provides a simple yet challenging testbed that requires solving a long horizon task using a large, changing action set, in presence of a single goal reward.
Case Study: RealWorld Recommender Systems.
We consider two realworld applications of largescale recommender systems that require decision making over multiple time steps and where the number of possible decisions varies over the lifetime of the system. Using millions of clicks collected by interaction with real users, we built lifelong learning domains for:

[leftmargin=*]

A webbased videotutorial platform, that has a recommendation engine to suggest a series of tutorial videos. The aim is to meaningfully engage the users in a learning activity. In total, tutorials were considered for recommendation.

A professional multimedia editing software, where sequences of tools inside the software need to be recommended. The aim is to increase user productivity and assist users in quickly achieving their end goal. In total, tools were considered for recommendation.
Results.
The plots in Figures 3 and 4 present the evaluations on the domains considered. The advantage of LAICA over Baseline(1) can be attributed to its policy parameterization. The decision making component of the policy, , being invariant to the action cardinality can be readily leveraged after every change without having to be reinitialized. This demonstrates that efficiently reusing past knowledge can improve data efficiency over the approach that learns from scratch every time. Compared to Baseline(2), which also does not start from scratch and reuses existing policy, we notice that the variants of LAICA algorithm still perform favorably. As evident from the plots in Figures 3 and 4
, while Baseline(2) does a good job of preserving the existing policy, it fails to efficiently capture the benefit of new actions. While the policy parameters in both LAICA and Baseline(2) are improved using policy gradients, the superior performance of LAICA can be attributed to the adaptation procedure incorporated in LAICA which aims at efficiently inferring the underlying structure in the space of actions. Overall LAICA(2) performs almost twice as well as both the baselines on all of the tasks considered. In the maze domain, even the best setting for Baseline(2) performed inconsistently. Due to the sparse reward nature of the task, which only had a big positive reward on reaching goal, even the best setting for Baseline(2) failed on certain trials, resulting in high variance.
8 Conclusion
In this work we established the lifelong MDP setup for dealing with action sets that change over time. Our proposed approach then leveraged the structure in the action space such that an existing policy can be efficiently adapted to the new set of available actions. Superior performances on both synthetic and largescale realworld environments demonstrate the benefits of the proposed LAICA algorithm. To the best of our knowledge, this is the first work to address the problem of lifelong learning with a changing action set.
References
 Sutton et al. (1999) Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 Puterman (2014) Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
 Thrun (1998) Sebastian Thrun. Lifelong learning algorithms. In Learning to learn, pages 181–209. Springer, 1998.

Ruvolo and Eaton (2013)
Paul Ruvolo and Eric Eaton.
Ella: An efficient lifelong learning algorithm.
In
International Conference on Machine Learning
, pages 507–515, 2013.  Silver et al. (2013) Daniel L Silver, Qiang Yang, and Lianghao Li. Lifelong machine learning systems: Beyond learning algorithms. In 2013 AAAI spring symposium series, 2013.
 Chen and Liu (2016) Zhiyuan Chen and Bing Liu. Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 10(3):1–145, 2016.
 French (1999) Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.

Kirkpatrick et al. (2017)
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume
Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka
GrabskaBarwinska, et al.
Overcoming catastrophic forgetting in neural networks.
Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.  LopezPaz et al. (2017) David LopezPaz et al. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017.
 Zenke et al. (2017) Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3987–3995. JMLR. org, 2017.
 Neu (2013) Gergely Neu. Online learning in nonstationary markov decision processes. CoRR, 2013.
 Gajane et al. (2018) Pratik Gajane, Ronald Ortner, and Peter Auer. A slidingwindow algorithm for markov decision processes with arbitrarily changing rewards and transitions. CoRR, abs/1805.10066, 2018.
 DulacArnold et al. (2015) Gabriel DulacArnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015.
 He et al. (2015) Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf. Deep reinforcement learning with a natural language action space. arXiv preprint arXiv:1511.04636, 2015.
 Bajpai et al. (2018) Aniket Nick Bajpai, Sankalp Garg, et al. Transfer of deep reactive policies for mdp planning. In Advances in Neural Information Processing Systems, pages 10965–10975, 2018.
 Chandak et al. (2019) Yash Chandak, Georgios Theocharous, James Kostas, Scott Jordan, and Philip S. Thomas. Learning action representations for reinforcement learning. International Conference on Machine Learning, 2019.
 Tennenholtz and Mannor (2019) Guy Tennenholtz and Shie Mannor. The natural language of actions. International Conference on Machine Learning, 2019.
 Mandel et al. (2017) Travis Mandel, YunEn Liu, Emma Brunskill, and Zoran Popović. Where to add actions in humanintheloop reinforcement learning. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 Nachum et al. (2018) Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Nearoptimal representation learning for hierarchical reinforcement learning. arXiv preprint arXiv:1810.01257, 2018.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
 Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 Kakade and Langford (2002) Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274, 2002.
 Kearns and Singh (2002) Michael Kearns and Satinder Singh. Nearoptimal reinforcement learning in polynomial time. Machine learning, 49(23):209–232, 2002.
 Pirotta et al. (2013) Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe policy iteration. In International Conference on Machine Learning, pages 307–315, 2013.
 Achiam et al. (2017) Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 22–31. JMLR. org, 2017.
 Devroye et al. (2017) Luc Devroye, László Györfi, Gábor Lugosi, and Harro Walk. On the measure of voronoi cells. Journal of Applied Probability, 2017.
 Shani et al. (2005) Guy Shani, David Heckerman, and Ronen I Brafman. An mdpbased recommender system. Journal of Machine Learning Research, 6(Sep):1265–1295, 2005.
 Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, volume 3, 2017.
 Konidaris et al. (2011) George Konidaris, Sarah Osentoski, and Philip S Thomas. Value function approximation in reinforcement learning using the fourier basis. In AAAI, volume 6, page 7, 2011.
Lifelong Learning with a Changing Action Set
(Supplementary Material)
Preliminary
For the purpose of our results, we would require bounding the shift in the state distribution between two policies. Techniques for doing so has been previously studied in literature [Kakade and Langford, 2002, Kearns and Singh, 2002, Pirotta et al., 2013, Achiam et al., 2017]. Specifically, we cover this preliminary result based on the work by Achiam et al. [2017].
The discounted state distribution, for all , for a policy is given by,
(7) 
Let the shift in state distribution between any given policies and be denoted as , such that
(8)  
(9) 
For any policy , let denote the matrix corresponding to transition probabilities as a result of . Then (9) can be rewritten as,
(10)  
(11) 
To simplify (11), let and .
Then,
, and therefore (11) can be written as,
(12)  
(13)  
(14) 
Note that,
(15) 
(16) 
Appendix A SubOptimality
Theorem 1.
In an LMDP, let denote the maximum distance in the underlying structure of the closest pair of available actions, i.e., , then
(17) 
Proof.
We begin by defining to be a policy where the actions of the policy is restricted to the actions available in . That is, any action from is mapped to the closest , where is in the available action set. Notice that the best policy, , using the available set of actions is always better than or equal to , i.e., . Therefore ,
(18) 
On expanding the corresponding for both the policies in (18) using (7),
(19)  
(20) 
We can then upper bound (20) by taking the maximum possible reward common,
(21)  
(22)  
(23) 
For any action taken by the policy , let denote the action for obtained by mapping to the closest action in the available set, then expanding (23), we get,
(24)  
(25)  
(26) 
From the Lipschitz condition (1), we know that . As corresponds to the closest available action for , the maximum distance for is bounded by . Combining (26) with these two observations, we get the desired result,
(27) 
∎
Corollary 1.
Let be the smallest closed set such that, . We refer to as the elementwisesupport of . If the elementwisesupport of in an LMDP is , then as the suboptimality vanishes. That is,
Proof.
Let be independent identically distributed random vectors in . Let define a partition of in sets , such that contains all points in whose nearest neighbor among is . Each such forms a Voronoi cell. Now using the condition on full elementwise support, we know from the distribution free result by Devroye et al. [2017] that the diameter() converges to at the rate as (Theorem 4, Devroye et al. [2017]). As corresponds to the maximum distance between closest pair of points in , . Therefore, when then ; consequently and thus .
∎
Theorem 2.
For an LMDP , If there exists a and such that
(28) 
where and , then
Proof.
We begin by noting that,
(29) 
Using Theorem (1),
(30) 
Now we focus on bounding the last two terms in (30). Following steps similar to (20) and (23) it can bounded as,
(31)  
(32)  
(33)  
(34) 
where stands for total variation distance. Using Pinsker’s inequality,
(35)  
(36) 
where, and . As condition (28) ensures that maximum KL divergence error between an actual and an action that can be induced through for transitioning from to is bounded by , we get the desired result,
(37) 
∎
Appendix B Lower Bound Objective For Adaptation
(38)  
(39) 
where is a constant corresponding to the entropy term in KL that is independent of . Continuing, we take the negative on both sides,
(40)  
(41)  
(42)  
(43) 
where is the normalizing factor. As is always positive, we obtain the following lower bound,
(45)  
(46)  
(47) 
where is another constant consisting of and is independent of .
Now, let us focus on , which represent the probability of the action given the transition . Notice that is selected by only using . Therefore, given , probability of is independent of everything else,
(48) 
Let
Comments
There are no comments yet.