continuous_state_sa
Code for running experiments involving learning state abstractions for continuous state RL problems.
view repo
Can simple algorithms with a good representation solve challenging reinforcement learning problems? In this work, we answer this question in the affirmative, where we take "simple learning algorithm" to be tabular Q-Learning, the "good representations" to be a learned state abstraction, and "challenging problems" to be continuous control tasks. Our main contribution is a learning algorithm that abstracts a continuous state-space into a discrete one. We transfer this learned representation to unseen problems to enable effective learning. We provide theory showing that learned abstractions maintain a bounded value loss, and we report experiments showing that the abstractions empower tabular Q-Learning to learn efficiently in unseen tasks.
READ FULL TEXT VIEW PDFCode for running experiments involving learning state abstractions for continuous state RL problems.
Finding the right representation is critical for effective reinforcement learning (RL). A domain like Backgammon can be tractable given a simple representation (Tesauro, 1995), but replace this typical input stream with a 3D point cloud of a Backgammon board, and the problem becomes intractable (Konidaris, 2019). In this context, representation learning could be thought of as a learning process that enables faster learning in future. In this paper we explore the limits of this reasoning. Specifically, we study whether unseen continuous control problems can be solved by simple tabular RL algorithms using a state-abstraction function learned from previous problems.
Identifying useful abstractions has long been a goal of AI and RL; indeed, understanding abstraction’s role in intelligence was one of the areas of inquiry of the 1956 Dartmouth Summer AI Workshop (McCarthy et al., 1955), and has been an active area of study since (Brooks, 1991; Dayan and Hinton, 1993; Singh et al., 1995; Sutton et al., 1999; Parr and Russell, 1998; Dietterich, 2000; Li et al., 2006). We narrow this study by concentrating on state abstractions that translate a continuous state space into a small discrete one, suitable for use with “tabular” learning algorithms which are simple and well-understood (Sutton and Barto, 2018). Prior work has studied a similar setting leading to tree-based methods for abstracting continuous state spaces (Moore, 1994; Uther and Veloso, 1998; Feng et al., 2004; Menashe and Stone, 2018), algorithms for finding state abstractions suitable for transfer (Walsh et al., 2006; Cobo et al., 2011, 2012; Abel et al., 2018), and methods for learning succinct representations of continuous states (Whiteson et al., 2007; Jonschkowski and Brock, 2015).
More concretely, we introduce and analyze an algorithm for learning a state abstraction. We then transfer the learned abstraction to help perform efficient reinforcement learning in unseen problems. Notably, this algorithm is well suited to continuous domains—after training, it outputs a state abstraction that maps continuous states into a small, finite state space, suitable for use with tabular RL algorithms. We present initial analysis and support for this abstraction-learning algorithm, emphasizing its ability to enable downstream tabular RL algorithms to solve new problems not seen during the training of the abstraction.
We provide a theorem on the sample complexity of learning the abstraction function, allowing us to relate the value loss under the learned abstraction function to both the number of samples used for training the abstraction function and to the Rademacher complexity of the set of abstraction functions in the hypothesis space. Moreover, our central experimental finding is that the learned state abstraction enables tabular Q-learning to perform sample efficient reinforcement learning in problems with continuous state spaces.
We begin with background on RL and state abstraction.
We take the standard treatment of RL (Sutton and Barto, 2018)
: An agent learns to make decisions that maximize reward in an environment through interaction alone. We make the usual assumption that the environment can be modeled by a Markov Decision Process (MDP)
(Puterman, 2014).State abstraction describes methods for reducing the size of an MDP’s state space, typically by aggregating states together in the environmental MDP. With a smaller state space that preserves some characteristics, RL algorithms can often learn to make good decisions from fewer samples and with less computation.
More formally, a state abstraction is a function, , that maps each ground state, , into an abstract state, . Usually, such state abstractions are chosen so that , thus the state space is more tractable to work with (either by making exploration, planning, or other aspects of decision making easier).
When the environment’s underlying state space is continuous however, a natural family of relevant abstractions are those that translate the continuous state space into a discrete one. Such functions can dramatically simplify problems that are otherwise intractable for certain types of RL algorithms (Moore, 1994; Uther and Veloso, 1998; Lee and Lau, 2004), like traditional Q-Learning (Watkins and Dayan, 1992) and TD-Learning (Sutton, 1988).
In order to study the sample complexity of our abstraction learning algorithm we use a tool from statistical learning theory referred to as Rademacher complexity
(Bartlett and Mendelson, 2002; Mohri et al., 2018). Consider a function , and an arbitrarily large set of such functions . We define Rademacher complexity of this set, , as follows:where
, referred to as Rademacher random variables, are drawn uniformly at random from
. One could think of these variables as independent and identically distributed noise. With this perspective, the average can be thought of as the covariance between and noise, or in other words, how well matches the noise. If for any realization of noise there exists an for which this average is large, then we can accurately learn noise, and the Rademacher complexity is large.We can extend the previous Rademacher definition to vector-valued functions that map to
. Imagine a function where . Define the set of such functions . We similarly define the Rademacher complexity of as follows:where are drawn uniformly randomly from .
We now summarize relevant prior literature.
We study state abstractions that map from continuous states to discrete ones—in this sense, we offer a method for learning to discretize the states of a continuous MDP. Prior literature has introduced algorithms for the same purpose. Moore (1994)
introduced the Parti-Game algorithm, which uses a decision tree to dynamically partition a continuous state space based on the need for further exploration. That is, as data about the underlying environment is collected, state partitions are refined depending on a minimax score with respect to an adversary that prevents the learning algorithm from reaching the goal (and knows the current partitioning scheme). For Parti-Game to be applied, we must assume 1) the transition function is deterministic, 2) the MDP is goal-based and the goal state is known, and 3) a local greedy controller is available.
Feng et al. (2004) also make use of a tree-based approach—this time, kd-trees (Friedman et al., 1976)—to dynamically partition a continuous MDP’s state space into discrete regions. In contrast to Parti-Game, partitions are chosen based on value equivalence, thereby enabling a form of closure under the Bellman Equation.
Chapman and Kaelbling (1991) study tree-based partitioning as a means of generalizing knowledge in RL, leading to the development of the G algorithm. The G algorithm constructs a data-dependent tree of Q-value partitions based on which Q value can adequately summarize different regions of the state space. Over time, the tree will grow to sufficiently represent the needed distinctions in states. Further work uses decision trees of different forms to partition complex (and often continuous) state spaces into discrete models (McCallum, 1996; Uther and Veloso, 1998).
Menashe and Stone (2018) also introduced an algorithm for abstracting continuous state, with the goal of inducing a small, tractable decision problem. They present RCAST (Recursive Cluster-based Abstraction Synthesis Technique), a technique for constructing a state abstraction that maps continuous states to discrete ones. Like the other tree-based methods discussed, RCAST uses kd-trees to partition the state space. The key insight at the core of RCAST is to partition based on the ability to predict different factors that characterize the state space. This is the main departure between the approach of RCAST and our own algorithm, which is not tailored to factored representations. Krose and Van Dam (1992), who present an adaptive quantization for continuous state spaces, to be used with simple RL algorithms like TD-learning. Lee and Lau (2004) offer a similar technique based on vector quantization that performs partitioning based directly on data gathered by TD learning algorithm. Whiteson et al. (2007) introduce a method for adaptive tile coding, to be used in value function approximation. Liang et al. (2016) presented evidence that “shallow” learning algorithms can achieve competitive performance to many Deep RL algorithms in Atari games. Their approach constructs features that are well suited to the structure of Atari games, including properties like relative object and color locations. The main result of the work shows that with a well crafted set of features, even a simple learning algorithm can achieve competitive scores in Atari games.
We are not the first to explore transferring state abstractions. Indeed, Walsh et al. (2006) studied the process of transferring state abstractions across MDPs drawn from the same distribution. Abel et al. (2018)
builds on this work by introducing PAC abstractions, which are guaranteed to retain their usefulness over a distribution of tasks. Notably, similar to our main theoretical result, PAC abstractions also retain value with respect to a distribution of MDPs with high probability based on the value loss of the abstraction family
(Abel et al., 2016). The crucial difference is that, again, PAC abstractions are tailored to discrete state spaces, and do not extend to continuous ones. Cobo et al. (2011) and Cobo et al. (2012) study methods for finding state abstractions based on a demonstrator’s behavior. Similarly to our approach, their method is based on finding abstract states that can be used to predict what a demonstrator will do in those clusters. The key differentiating factor is that our algorithm targets continuous state spaces, while theirs focuses on discrete state spaces.Majeed and Hutter (2018) discussed Q-Learning convergence in non-Markov decision problems. In particular, Theorem 7 of their work reports that Q-Learning converges in some non-MDPs, under relatively mild assumptions. They go on to characterize the necessary and sufficient conditions for Q-Learning convergence, which does in fact extend to some non-MDPs. In this sense, their work builds on Theorem 4 from Li et al. (2006)
which states that Q-Learning, using a policy-based abstraction (similar to what we learn), can sometimes converge to a policy that is sub-optimal in the original problem. Indeed, in experiments, we find this to not be the case – it is still an open important question as to how certain abstractions effect both the estimation error and sample complexity of different learning algorithms.
A separate but equally relevant body of literature investigates learning state representations
in the context of deep neural networks, typically for use in deep RL. For instance,
Jonschkowski and Brock (2015) proposed learning state representations through a set of well chosen ontological priors, catered toward robotics tasks, including a simplicity prior and a causality prior (among others). These priors are then encoded into an optimization problem that seeks to jointly optimize over each of their prescribed properties. They conduct experiments in a continuous grid domain similar to Puddle World, comparing the use of all priors to each one individually, showcasing a consistent performance increase in learning with all priors. (Karl et al., 2017)developed a variational Bayes method for learning a latent state-space representation of a Markov model, given high dimensional observations. Critically, this state space is of a simple Markov model, and does not involve decision making or rewards, which are critical aspects of learning state representations in MDPs
(Oh et al., 2017). For a full survey of recent state representation schemes for deep RL, see Lesort et al. (2018).Naturally, many active areas of recent literature explore transfer for RL. For instance, work on learning to learn (Thrun and Pratt, 1998), or meta RL (Finn et al., 2017), investigate how to explicitly improve sample efficiency across a collection of tasks. Recent work in meta RL has studied how to improve model-based RL (Sæmundsson et al., 2018) and how to learn a new RL algorithm explicitly (Wang et al., 2016). Other areas of research have explored transferring shaping functions (Konidaris and Barto, 2006; Konidaris et al., 2012), successor features (Barreto et al., 2017), skills (Konidaris and Barto, 2007; Da Silva et al., 2012; Brunskill and Li, 2014), hierarchies (Wilson et al., 2007; Mehta et al., 2008; Tessler et al., 2017), behavior (Taylor and Stone, 2005), and transfer for deep RL (Higgins et al., 2017; Parisotto et al., 2015; Teh et al., 2017). For a survey of transfer in RL see Taylor and Stone (2009).
In this section, we formulate an optimization problem and propose an algorithm for learning state abstractions for MDPs with continuous states. Our approach builds on a type of abstraction studied by Li et al. (2006) in the tabular setting. Specifically, the abstraction functions attempts to cluster together the states for which the optimal policy is similar and map these states to a common abstract state. Our goal will be to learn such an abstraction function on a set of training problems, and then transfer and use this representation for reinforcement learning in unseen problems.
We define a stochastic abstraction function as a function that maps from ground states
to a probability distribution over abstract states
(Singh et al., 1995). We focus on problems where is infinite, but is finite (problems with continuous state space and an abstraction that maps to a discrete one). Our goal is to find a state abstraction that enables simple learning algorithms to perform data-efficient reinforcement learning on new tasks.To learn an abstraction function, we introduce an objective that measures the probability of trajectories provided by our learned policy . The goal will be to maximize this probability if we were to use the abstraction function and a policy, , over abstract states. We formulate the optimization problem as follows:
where:
The second sum is not a function of the optimization variables, and could be dropped:
More importantly, we consider the case where our training set consists of different MDPs. In this case, we solve for a generalization of the above optimization problem, namely:
If the solution to the optimization problem is accurate enough, then states that are clustered into a single abstract state generally have a similar policy in the MDPs used for training.
Although it is possible to jointly solve this optimization problem, for simplicity we assume is fixed and provided to the learner. We parameterize the abstraction function by a vector representing the weights of the network . We use softmax activation to ensure that outputs a probability distribution.
A good setting of can be found by performing stochastic gradient ascent on the objective above, as is standard when optimizing neural networks (LeCun et al., 2015):
In our experiments we used the Adam optimizer (Kingma and Ba, 2015) which can be thought of as an advanced extension of vanilla stochastic gradient ascent.
Given a learned state abstraction, it is natural to ask if the policy over the learned abstract state space can achieve a reasonable value. In this section we answer this question positively by providing a bound on the value loss of the abstract policy relative to the demonstrator policy.
We now outline our proof at a high level. The first step shows the abstraction learning algorithm described in Section (4) has bounded policy difference on expectation under states in the training set. We then use Rademacher complexity to generalize this policy difference to any set of states drawn from the same distribution. Given the generalization bound, we use the following lemma to conclude that the abstraction has bounded value loss.
(Corollary of Lemma 2 in Abel et al. (2019)) Consider two stochastic policies, and on state space , and a fixed probability distribution over , . If, for some :
then:
To satisfy the assumption of the lemma, we first show that the objective function can be rewritten using Kullback-Leibler (KL) divergence. For the rest of this section, we assume that each cluster assigns probability 1 to the action , allowing us to simplify notation:
Now, suppose that our training procedure has a bounded KL loss formalized below:
Using Pinsker’s inequality, we have:
That is, we have a bounded policy difference. Critically, this bound is only valid given set of states seen in the training data, but to leverage Lemma 1, we need to bound the generalization error. We get this result by introducing a theorem. Lemma below is used in the theorem.
(Corollary 4 in (Maurer, 2016)) Assume a set of vector valued functions . Assume L-Lipschitz functions . Then, the following inequality holds:
With probability at least , for any :
We build on techniques provided by Bartlett and Mendelson (2002). First, note that we have:
(1) | ||||
We can bound the expected value of :
(2) |
We do so as follows:
since and are distributed similarly.
(Due to Jansen’s inequality) | ||
(Due to Lemma (2) and being 1-Lipschitz) | ||
Further, it is easy to show that satisfies
Applying MacDiarmid’s inequality, for any :
(3) |
Combining (3) with (2), we can write:
Now recall from that:
we can conclude the proof by claiming that
with probability at least . ∎
It is clear to see that, given the above bound, Lemma 1 ensures bounded value loss with probability at least .
We now present our experimental findings. At a high level, we conducted two types of experiments:
Single Task: We collected an initial data set to be used to train the state abstraction based on MDP . Then, we evaluate the performance of tabular Q-Learning on , given this state abstraction. Since each we test with has continuous state, tabular Q-Learning cannot be applied. However, we assess the performance of tabular Q-learning given . These experiments provide an initial sanity check as to whether the state abstraction can facilitate learning at all.
Multi-Task: We next considered a collection of MDPs . We collected an initial data set of tuples from a (strict) subset of MDPs in the collection. We used to construct a state abstraction , which we then gave to Q-Learning to learn on one or many of the remaining MDPs. Critically, we evaluate Q-Learning on MDPs not seen during the training of .
For each of the two experiment types, we evaluate in three different domains, Puddle World, Lunar Lander, and Cart Pole. Open-source implementation of Lunar Lander and Cart Pole are available on the web
(Brockman et al., 2016). An implementation of Puddle World is included in our code. Full parameter settings and other experimental details are available in our anonymized code, which we make freely available for reproduction and extension.^{1}^{1}1https://github.com/anonicml2019/icml_2019_state_abstractionLearning curves for the single task experiments (top) and the transfer experiments( bottom). Each line indicates the average cumulative reward received by the agent, reported with 95% confidence intervals.
Our first experiment used the Puddle World MDP (Boyan and Moore, 1995), a continuous grid world in which states are represented by two coordinates, and . The agent is initialized at (0.25, 0.6), and is tasked with getting within .0025 of the goal location, which is placed at (1.0,1.0). Two large rectangles are placed in the domain, which produce reward for any time-step in which the agent is in either rectangle. All other transitions receive 0 reward, except for transitions into the goal state, which yield 1 reward. The agent is given four actions, up, down, left, and right. Each action moves the agent .05 in the given direction with a small amount of Gaussian noise.
In the single task Puddle World experiment, we trained the abstraction based on 4000 sampled quadruples from the puddle instance pictured in Figure (a)a with only goal active, and the training policy. The samples are drawn from , the joint uniform probability distribution over and . Notably, since the domain is continuous, the learning agent will necessarily find itself in states it did not see during training time. We experiment with tabular Q-Learning paired with the abstraction, denoted Q-Learning- (green), and Q-Learning with a linear function approximator, denoted Linear-Q (blue). We set and for both algorithms. It is worth noting that since the training of takes place in the same domain that we test Q-Learning-, we should anticipate that the resulting pair learn quite well (assuming the is capable of supporting good learning at all).
Results for the single task case are presented in Figure (d)d. The figures present the cumulative reward averaged over all 25 runs of the experiment, with confidence intervals. As expected, we find that Q-Learning-
consistently learns to navigate to the goal in only a few episodes. Conversely, the linear approach fails to reliably find the goal, but does occasionally learn to avoid running into the puddle, resulting in high variance performance.
In the multi-task Puddle World experiment, we train using three out of the four possible goals (each goal defines an independent MDP), located in the four corners. The held-out goal is chosen uniformly randomly from the four possible goals. We leave one goal out of training to be used for testing. All parameters are set as in the single task case, except that the agents now learn for 250 episodes instead of 100. The abstraction learning algorithm was given a budget of 81 abstract states.
Results are presented in Figure (g)g. Notably, the only positive reward available comes from reaching the goal, so we can conclude from the low-variance positive slope of the learning curve that Q-Learning with still learns to reach the goal reliably in only a few episodes. Moreover, Q-learning with linear function approximation fails to ever reliably find the goal.
We next experiment with Lunar Lander, pictured in Figure (b)b. The agent controls the purple ship by applying thrusters on exactly one of the left, right, bottom, or no sides of the ship. The state is characterized by 8 state variables, 2 dimensional position and velocity information, angle and angular velocity, and two boolean flags that are active only if the corresponding leg touches a surface. The agent receives +100 reward when it lands or -100 when it crashes, and an additional +10 reward for each leg of the ship touching the ground. Critically, the agent receives -.3 reward every timestep it uses a thruster for each of the three thrusters available. The default reward signal in gym implementation includes a shaped reward that gives the agent additional positive reward the closer it gets to the landing zone, positioned at (0,0) on the screen (the flat area with the flags). In each run, the mountainous terrain to the left and right of the landing zone changes.
In the single task case, we train the state abstraction based on 10,000 quadruples, sampled according to the training policy (which reliably either lands or gets high shaped reward). We then give the resulting abstraction (which maps from the continuous 8-dimensional states to a discrete state space), to tabular Q-Learning and evaluate its learning performance.
Results are presented in Figure (e)e. Notably, again, Q-Learning with learns a policy that reliably lands the ship in around 50 episodes. In contrast, the linear Q-Learning agent fails to ever learn a reasonable policy even after collecting 8 times the data. The results suggest that again, is capable of supporting sample efficient learning of a high value policy.
In the transfer case, we train similarly to the single task case. Then, we define a new instance of the Lunar problem with no shaped reward. In this new, non-shaped variant, the agent receives for every use of a thruster, for crashing, for landing safely, and anytime a leg touches the ground. This variant of Lunar is extremely challenging, and to our knowledge, no known algorithm has solved this problem from scratch.
Learning curves are presented in Figure (h)h, and a plot of the successful landing rate over time is also provided in Figure (a)a. Here we find the strongest support for the usefulness of : Q-Learning- learns to land the ship more than half of the time after around 250 episodes. By the end of learning (500 episodes), Q-Learning with has learned to land the ship four out of every five trials.
Cart Pole is a classical control problem first studied by Widrow and Smith (1964) in which an agent must learn to balance a pole upright by moving a cart at the base of the pole along a horizontal track. The state of the MDP consists of four variables: , denoting the position, velocity, angle, and angular velocity respectively. The agent is given only two actions: move left and move right. The reward signal is +1 when the pole is balanced (, and otherwise -10. When the pole moves outside of the positive reward region, the agent receives -10 reward and the MDP terminates.
We train based on a data set of 1000 quadruples collected from the training policy, with a learning rate of . Then, we give Q-Learning to learn on the same MDP for 50 episodes, with a max of 200 steps per episode. We repeat the experiment 20 times and report average learning curves and 95% confidence intervals.
Results are presented in Figure (f)f. Here we find that LinearQ is able to efficiently learn a high value policy, along with Q-Learning-. From the set of single task experiments, we conclude that is at least sufficient to support tabular learning in continuous state MDPs in which was trained.
In the multi-task case, we train with the same parameters as in the single task experiment. Then, we build a collection of 5 test MDPs where we change gravity from to each of 5.0, 6.0, 8.0, and 12.0 . We then evaluate Q-Learning- on its average performance across each of these 5 test MDPs. When learning begins, we sample one MDP from the test set and let the agent interact with the sampled MDP for 200 episodes. After the last episode, the agent samples a new MDP uniformly from the same set of 5 and learns, repeating this process for 20 rounds.
Results are presented in Figure (i)i. The learning curve is very abrupt; after only a handful of episodes, both learning algorithms again reliably learns to balance the pole, regardless of how much gravity has changed from the original problem. Here we find support for the use of a linear approximator in some settings, but naturally, as the environment becomes complex, the linear approach fails (as in the puddle and lunar experiments)
As a final experiment, we explore the effect of the size of the training data set used to train on the performance of the downstream RL task that will be used for. Our goal is to investigate how many samples are sufficient for the learned state abstraction to reliably produce good behavior in practice. In the experiment, we trained for up to samples (by increments of 500) to convergence. We then gave the resulting to Q-Learning and let it learn in Puddle World with all parameters set as in the single task Puddle experiment, and report the cumulative reward of Q-Learning’s final episode.
Results are presented in Figure (b)b. As expected, when the data set used to train only contains a single point, RL performance decays substantially, with Q-Learning receiving 0 reward on average. However, somewhat surprisingly, with only 501 data points, the computed is sufficient for Q-Learning to obtain an average cumulative reward of 0.5. Such an average score indicates that, in about half of the runs, Q-Learning- reliably found the goal, and, in the other half, it avoided the puddle.
We introduced an algorithm for state abstraction and showed that transferring the learned abstraction to new problems can enable simple RL algorithms to be effective in continuous state MDPs. We further studied our abstraction learning process through the lens of statistical learning theory and Rademacher complexity, leading to the result that our learning algorithm can represent high value abstract policies. We believe that this work takes an important step towards the problem of automatic abstraction learning in RL.
One avenue for future research extends our approach to continuous action spaces; in this case, we can no longer leverage the fact that the number of actions are finite. Moreover, learning with continuous state and action is notoriously difficult and ripe for abstraction to make a big impact. Additionally, there are open questions on learning abstractions in a lifelong setting, in which agents continually refine their abstractions after seeing new problems. Lastly, it is unknown as to how different kinds of abstractions effect the sample complexity of RL (Kakade, 2003). Thus, we foresee connections between continuous state exploration (Pazis and Parr, 2013) and learning abstractions that can lower the sample complexity of RL.
Proceedings of the AAAI Conference on Artificial Intelligence
, Cited by: Lemma 1.Proceedings of the International Conference on Machine Learning
, Cited by: §1, §3.Deep variational Bayes filters: unsupervised learning of state space models from raw data
. Proceedings of the International Conference on Learning Representations. Cited by: §3.