Two geometric input transformation methods for fast online reinforcement learning with neural nets

05/18/2018 ∙ by Sina Ghiassian, et al. ∙ University of Alberta 0

We apply neural nets with ReLU gates in online reinforcement learning. Our goal is to train these networks in an incremental manner, without the computationally expensive experience replay. By studying how individual neural nodes behave in online training, we recognize that the global nature of ReLU gates can cause undesirable learning interference in each node's learning behavior. We propose reducing such interferences with two efficient input transformation methods that are geometric in nature and match well the geometric property of ReLU gates. The first one is tile coding, a classic binary encoding scheme originally designed for local generalization based on the topological structure of the input space. The second one (EmECS) is a new method we introduce; it is based on geometric properties of convex sets and topological embedding of the input space into the boundary of a convex set. We discuss the behavior of the network when it operates on the transformed inputs. We also compare it experimentally with some neural nets that do not use the same input transformations, and with the classic algorithm of tile coding plus a linear function approximator, and on several online reinforcement learning tasks, we show that the neural net with tile coding or EmECS can achieve not only faster learning but also more accurate approximations. Our results strongly suggest that geometric input transformation of this type can be effective for interference reduction and takes us a step closer to fully incremental reinforcement learning with neural nets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 17

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning systems must use function approximation in order to solve complicated real-world problems. Neural nets provide an effective architecture for nonlinear function approximation [Cybenko1989], and their ability to adapt through a data-driven training process makes them powerful general function approximators for self-learning systems. Neural nets have been used since the early days of reinforcement learning [Miller, Sutton, and Werbos1990, Tesauro1995]

; reinvigorated now by the advances in deep learning, they are the driving force behind most recent progresses toward large-scale reinforcement learning

[Mnih et al.2015, Silver et al.2016, Wang et al.2018].

On the other hand, it was also known early on that as neural nets generalize globally in training, they have a weakness: when the network gains new experience, it tends to “forget” what it has learned in the past, a phenomenon known as “catastrophic interference” [McCloskey and Cohen1989, French1999]. Indeed, since in a neural net a function is implicitly represented by the weights on the network connections, changes in a few weights during training could result in global changes of the function. In this paper we are interested in catastrophic interference as it arises in online reinforcement learning.

Solution methods have been proposed to address the catastrophic interference issue in the supervised learning context

[French1999, Lee et al.2017]

; however, these methods are not readily available for use in the reinforcement learning framework. Most if not all of these techniques focus on multi-task supervised learning and are specific to the transfer learning context and are not applicable to the online reinforcement learning context with rapidly changing policies. The main reason behind this limitation is that interference reduction techniques proposed for supervised learning settings rely on the fact that the learning agent can successfully learn to make progress on a single task. As we will show later in the paper, due to severe interference, systems that use neural nets as function approximators can fail to make progress even on a single task in online reinforcement learning. Failing to solve a single task successfully, it is then out of the question to apply multi-task transfer-learning techniques to the reinforcement learning framework. An interesting research direction is to adapt these supervised learning techniques to the reinforcement learning context; however, this is not what we pursue in this work. Instead, we propose novel methods that are suitable for online reinforcement learning.

The main strategy to work around the catastrophic interference problem in reinforcement learning has been to train neural networks offline with batch data and experience replay buffers

[Mnih et al.2015, Lin1993]. Experience replay uses, at each training step, a rich sample of past and newly gathered data in order to update the neural network weights. This strategy seems to work in practice, but it requires a lot of memory and computation and slows down the training speed. Moreover, experience replay avoids interference at the cost of losing the advantages of online updating, which is one of the important characteristics of the reinforcement learning framework [Sutton and Barto2018].

In this work, we explore alternative fully incremental online approaches to mitigate the catastrophic interference problem instead of using experience replay. To begin with, we consider a two-layer network with one hidden layer and focus on the behavior of individual nodes that use the popular ReLU gates. We recognize that with the ReLU activation function, a neural node has to respond linearly to inputs from an entire half-space, and this global nature of ReLU gates can cause undesirable learning interference in each node’s learning behavior. The observation led us to propose reducing such interferences with two input transformation methods. Both methods are geometric in nature, and as we will show, their geometric properties match well the geometric property of ReLU gates. As we will discuss in more detail later in the paper, both methods enable the neural nodes to respond to a local neighbourhood of their input space. This can help neural networks to generalize more locally and prevent interference. While input transformation is one major approach to address the interference problem in neural nets 

[French1999], the two geometric methods we study in this paper have not been considered before, to our knowledge.

The first method is tile coding [Albus1975], a classic binary encoding scheme that captures the topological structure of the input space in the codes and can help promote local generalization. We refer to the combination of tile coding with neural nets as TC-NN. We will show that compared to neural nets operating on raw inputs, TC-NN generalizes more locally, has less interference, and learns much faster. We will also show that TC-NN has advantages over the classic approach of combining tile coding with a linear function approximator (TC-Lin), especially for high-dimensional problems, in terms of function approximation capability.

The second method (EmECS) is a new method we introduce. It is based on topological embedding of the input space and geometric properties of convex sets. The idea is to embed the input space in the set of extreme points of a closed convex set, so that although with ReLU, a neural node must always respond linearly to all points from an entire half-space of the transformed input space, with respect to the original input space, it can respond only to the inputs from a small local region, thus reducing its learning interference. As we will show, EmECS can be implemented easily and efficiently, and it differs from other high-dimensional representations in that (i) it does not increase the dimensionality of the inputs by much (indeed it can work with just one extra dimension), and (ii) it can be applied on top of any pre-extracted features that are suitable for a given task. As we will also show, EmECS shares some similarities with coarse coding [Hinton, McClelland, and Rumelhart1986], of which tile coding is a special case, despite their being seemingly unrelated. Our experimental results show that with EmECS, neural nets can perform as well as TC-NN, achieving both fast learning and accurate approximations.

The rest of this paper is organized as follows. We first provide the background on nonlinear TD() and Sarsa(). We then discuss TC-NN and EmECS methods. We present experimental results before ending the paper with a discussion on future work. A few supporting results and detailed discussions are collected in the appendices.

2 Background: Nonlinear TD() and nonlinear Sarsa()

In this paper we use TD() and Sarsa() methods [Sutton1988, Rummery and Niranjan1994]

for solving prediction and control problems respectively. The prediction problem is that of learning the value function of a given stationary policy in a standard Markov Decision Process (MDP) with discounted or total reward criteria 

[Puterman1994]. Specifically, an agent interacts with an environment at discrete time steps . If at time the agent is in state and selects an action , the environment emits a reward and takes the agent to the next state

according to certain probabilities that depend only on the values of

. We consider problems where the action space is finite, and the state space is either finite or a bounded subset in a Euclidean space—the problems in our experiments have continuous state spaces. A stationary policy is represented by a function , which specifies the probabilities of taking each action at a state in . The value function of the policy is defined by the expected sum of the discounted future rewards (or simply the expected return), , for all , where is the discount factor and denotes taking expectation under policy

. The prediction problem for the agent is to estimate

. In the control problem, policies are not fixed and through interactions with the environment, the agent needs to find an optimal policy that maximizes the expected return.

For prediction problems, we apply TD() with nonlinear function approximation to update the weights of a neural network using a small step size according to

(1)

Here is the temporal-difference error and

the eligibility trace vector, calculated iteratively as

(2)
(3)

where represents the approximate value for state produced by the neural net with weights , and denotes the gradient of the function at .

Sarsa() is the control variant of TD(). When it was first proposed, it actually used neural networks as function approximators [Rummery and Niranjan1994]. Its update rules are similar to those of TD() except that is replaced by , the approximate value for the state-action pair produced by the neural net. The action at each time step is typically chosen in an -greedy way with respect to the current approximating function. (We use .)

The network structures we used for prediction and control are different. For prediction, the network receives, as input, the state (or a representation of it) and outputs the approximate value for that state. For control, the input stays the same, but the network outputs multiple values, one for each action, to approximate the state-action values at that state. All neural nets in this work have a single hidden layer that uses ReLU gates and a linear output layer that does not use any gate function.

3 Tile coding plus neural networks: TC-NN

Figure 1: A continuous 2D space with 1 tiling on top of it is shown on the left. Three overlapping tilings on the 2D continuous space are shown in the middle in blue, green and red. The generalization region for a sample point is shown on the right.

Tile coding is a form of coarse coding, in which we cover the state space with overlapping sets, and encode a state by a binary string, where the bits that are indicate which sets contain . These overlapping sets capture, in a coarse way, the topological structure of the state space (i.e., which points are close together and which regions are connected to each other), and the encoding carries this structural information. In tile coding the overlapping sets are hyper-rectangles; Figure 1 illustrates a simple encoding for a D space. The states are thus mapped to vertices of a unit cube in a higher dimensional space. Tile coding is well-suited when the physical states occupy only a small portion of the input space, and also when the state space is non-Euclidean and has a natural product structure, as in many robotics applications. For example, in Acrobot, two angular control parameters lie on a Cartesian product of two circles (a torus) and can be tile-coded efficiently.

Tile coding was invented by Albus Alb75,Alb81. It is the key component of his CMAC computation architecture, which is, in fact, tile coding plus a linear function approximator (TC-Lin). The nonlinear input map provided by the encoding was to facilitate local generalization: the result of training at a particular state generalizes locally to the “neighborhood” of that state as defined by the union of those sets that contain the state (cf. Figure 1

). CMAC has been applied in control and robotics and is known, among brain-inspired computational models, as a different type of neural network, an alternative to the globally generalizing, backpropagation-type neural net

[Miller, Glanz, and Kraft1990, Balakrishnan and Weil1996]. In reinforcement learning, Lin and Kim LiK91 proposed CMAC/TC-Lin for TD(). Tham Tha94 used it with a variety of online reinforcement learning algorithms, including Q-learning and Sarsa(), to solve complex robotics problems. Other successful examples of using TC-Lin with Sarsa() were also shown by Sutton Sut96. (See the textbook [Sutton and Barto2018, Chapter 9.5] for an excellent introduction to tile coding and its applications in reinforcement learning.) Given the rich history of TC-Lin, our proposal to combine tile coding with a neural net may seem unorthodox at first sight. Let us now explain the merits of this TC-NN combination, as well as its differences from TC-Lin, from several perspectives.

It is true that a neural net tends to generalize globally, so in TC-NN each neural node tends to respond to a much larger area of the state space than an ideal local neighborhood as in TC-Lin. However, tile coding gives each node the ability to pick the size and shape of its activation region with respect to the original state space (see Appendix D for examples of activation regions of TC-NN from an experiment). In contrast, if the neural net works on the state space directly, every node has to respond to an entire half-space linearly. This causes interference and can slow down learning considerably, as we observed in the Acrobot problem (Figure 2). Sometimes the interference can be so severe that it prevents the network from learning at all (see Appendix B for some failure examples of neural nets with raw state inputs and with RBF features).

Figure 2:

Learning curves for TC-Lin and TC-NN with the joint tile coding scheme (TCj-Lin and TCj-NN), and neural nets with raw inputs (NN). Standard deviation and mean of

runs are shown for each curve. TCj-NN was fast and converged to a lower final performance when . As got larger, different methods performed more similarly.

An advantage that TC-NN has over TC-Lin is in the function approximation capability. This becomes critical, as the dimensionality of the state space

increases. To cope with the curse of dimensionality, when

has a natural Cartesian product structure, one can tile-code separately each component in the product. This encoding captures the same information as tile-coding all the dimensions of jointly, but is much more efficient, since the resulting code length then scales linearly with the dimensionality of . However, with a linear function approximator, the encoding is also tied with how TC-Lin generalizes during training and what functions TC-Lin can approximate. As the result of these strong ties, if we tile-code each dimension separately: (i) the generalization of TC-Lin becomes global, and (ii) the set of functions TC-Lin can approximate becomes limited, since it can represent only functions that are sums of functions of each component. In contrast, for TC-NN, if we use the separate tile coding scheme: (i) the neural net still has the freedom to choose regions of generalization as before, and these regions need not be as global as those in TC-Lin, and (ii) the set of functions that the neural net can approximate remains the same. The latter is because with either the separate or joint tile coding scheme, the states are mapped to vertices of a hyper-unit-cube with the same granularity, and the neural net can separate each vertex from the rest of the vertices by using a single hidden node (with ReLU) and assign a specific value to that vertex. We will show experimental results that confirm this advantage of TC-NN in the experimental results section (cf. Figure 6) and in Appendix D (cf. Figure 16), where we will also discuss this subject in a more intuitive manner.

4 Embedding into Extreme points of a Convex Set (EmECS)

We now introduce a new input transformation method, EmECS, for reducing the learning interference of individual neural nodes with ReLU gates. This method is based on two geometric properties:

  • [leftmargin=18pt]

  • With ReLU, the activation region of a neural node is the open half-space

    (4)

    that corresponds to the hyperplane

    , where is the vector of weights and the scalar bias term associated with the node.

  • For a closed convex set , consider a point and the neighborhoods of relative to (i.e., the intersections of its neighborhoods with ). If is an extreme point of , there is a hyperplane whose open half-space (4) contains only a (arbitrarily) small neighborhood of relative to .111An extreme point of a convex set is one that cannot be expressed as a convex combination of other points of . For a closed convex set , by Straszewicz’s Theorem [Rockafellar1970, Theorem 18.6], every extreme point is the limit of some sequence of exposed points, where an exposed point of is a point through which there is a supporting hyperplane that contains no other points of [Rockafellar1970, Section 18, p. 163]. This means that for an exposed point , there is a linear function achieving its maximum over uniquely at . Consequently, for any extreme point , we can choose an exposed point sufficiently close to so that for some linear function with the property just mentioned and for some , the half-space contains only a small neighborhood (relative to ) of . As this neighborhood of consists of the -optimal solutions of , it can be made arbitrarily small by the choices of and . The corresponding linear function then gives the hyperplane with the desired property (ii), thus proving our claim. The left part of Figure 3 illustrates this property of an extreme point.

If is the original input space of the neural net ( can be the state space of the problem or the space of any pre-extracted features of states), our method is to embed in the set of extreme points of a closed convex set in a higher dimensional Euclidean space, and let the neural-net work with the transformed inputs instead of the original inputs.

Here, by embedding, we mean a one-to-one continuous map whose inverse is also continuous. Such a map is called a topological (or homeomorphic) embedding because it preserves topological properties [Engelking1989]. For example, if the states lie on a manifold in (say, a torus), their images under lie on a topologically equivalent manifold (thus also a torus), and if the states form two disconnected sets in , so do their images under . By combining this topology-preserving property of an embedding with the geometric properties of convex sets discussed earlier, we obtain the following. If we choose a closed convex set whose boundary points are all extreme points, and if we embed into the boundary of , then a neural node with a ReLU gate, when applied to the transformed inputs, becomes capable of responding to only a small neighborhood of any given point in the original input space (cf. Figure 3). This explains the mechanism of our EmECS method: it enables each neural node to work locally, despite the global nature of ReLU.

Of course, having the ability of localized generalization at each node does not mean that the network always allocates a small region to each node or different regions to different nodes—indeed it is hard for such coordination between nodes to emerge automatically during training. Nonetheless, our experiments showed that EmECS can improve considerably neural nets’ learning performance.

Figure 3: On the left is a cross-section view of a hyperplane cutting out a small part of a closed convex set (in ) around an extreme point (black dot). This point corresponds to the point in the original input space shown on the right. The entire half-space above the hyperplane is the activation region of a neural node in the transformed input space. But only those boundary points of above the hyperplane (indicated by the thick black line in this cross-section view) correspond to real inputs in , which form a small neighborhood (shaded area) around the point (black dot). The neural node thus only responds to inputs from that neighborhood in .

We can implement EmECS efficiently. Below are a few simple examples of the embedding; in our experiments we have used (a) and (c) (which give the LPj-NN and LPs-NN algorithms in the experimental results section).

Example 1 (Some instances of maps for EmECS)

Suppose .

  • [leftmargin=18pt]

  • Map into an -sphere of radius in by first “lifting” the set along the -th dimension and then projecting it on the sphere. Specifically, for , let

    (5)

    We shall refer to this type of map as lift-and-project (LP for short).

  • Let be a continuous, strictly convex function on (e.g., ). Map to This embeds into the graph of the function , and the closed convex set here is the epigraph of : .

  • If where each , we can separately embed each in , with a map of the form given in (a)-(b), for instance. The result is the embedding of in given by The range of is a subset of extreme points of the closed convex set , where is the convex set associated with the embedding . Sometimes, a component space already contains the desired embedding of the state components (e.g., when the latter lie on a circle or sphere in ). Then we do not need to embed any more and can simply take above to be the identity map .

EmECS shares some similarities with coarse coding [Hinton, McClelland, and Rumelhart1986] despite their being seemingly unrelated. With EmECS, the activation regions of individual nodes, viewed in the original input space, resemble the receptive fields (i.e., the overlapping sets) in coarse coding. Like the latter, each activation region is connected (thanks to the embedding property), if the hyperplane associated with the node has its normal vector point in the right direction in the transformed input space. For instance, for the lift-and-project map in Example 1(a), it suffices that the normal vector points “upwards” with respect to the extra -th dimension (in our experiments we always initialize the network weights in this way). In coarse coding, receptive fields can have different sizes and shapes, but they are chosen before learning takes place. With EmECS, the activation regions of neural nodes change their locations and sizes dynamically during learning. The shapes of these regions depend partly on the embedding, so by choosing the embedding, we can have some influence over them, like in coarse coding. For example, the separate embedding scheme in Example 1(c) gives the network more freedom to produce activation regions that are asymmetric, wider in some dimensions and narrower in others, whereas a joint embedding scheme like Example 1(a) can be used to enforce symmetric or specific asymmetric shapes. (See Appendix E for illustrations and a more detailed discussion.)

Let us now compare EmECS and TC-NN. They both map the original inputs to the extreme points of a convex set—in the case of tile coding, the convex set is a hypercube and the extreme points are its vertices, and they both use the topological structure of the original space to do so. A difference between them is that for EmECS the input transformation is an embedding, whereas for tile coding it is not. As a consequence, in TC-NN, an activation region of a neural node, viewed in the original input space, can (and usually do, as observed in our experiments) contain multiple disconnected components. This suggests that one may be able to further improve the performance of TC-NN by initializing the neural net in a certain way or by monitoring and “pruning” the activation regions of its nodes during training. Another difference between the two methods is in computational complexity. For TC-NN, suppose each dimension of the inputs can be tile-coded separately; then the dimensionality of the transformed inputs will still depend on the size of the original input space along each dimension. In contrast, EmECS only increases the dimensionality of the inputs by the number of component spaces that are embedded separately (cf. Example 1). So, given the same number of hidden-layer nodes, the neural net with EmECS has much fewer parameters than the TC-NN network.

5 Experimental results

In this section, we show experimentally that our proposed methods are fast and accurate. We compare our proposed methods with two existing online methods: tile coding plus linear function approximation (TC-Lin) and neural networks with raw inputs (NN). We do not compare our methods to experience replay as it is not a fully incremental online method. Our proposed methods are tile coding plus neural networks (TC-NN) and EmECS plus neural networks. From EmECS, we used the lifting-and-projecting scheme plus neural networks (LP-NN). We add letters j and s to TC and LP (e.g., TCj-NN or TCs-NN) to show whether the dimensions of the input are transformed in a joint or in a separate fashion. LPj-NN uses the lift-and-project map in Example 1(a), and LPs-NN uses the separate embedding scheme in Example 1(c) with each component map being a lift-and-project map. We first use three small problems to compare TCj-NN, LPj-NN, NN, and TCj-Lin: Mountain Car prediction, Mountain Car control and Acrobot control. All problems are on-policy, undiscounted, and episodic. We perform another set of experiments on the Mountain Car prediction problem to study the effects of transforming the dimensions of the input jointly or separately. Finally, we assess the practicality of our methods in higher dimensions by applying TCs-NN and LPs-NN to a real world robot problem in an off-policy continuing setting. We also present (in the appendix on learning interferences in online training of neural nets) the results of using RBF kernels to transform the input space and show that not every input transformation method that uses the neighborhood information or creates sparse features can be effective. Implementation details of these experiments are given in Appendix A.

Figure 4: Learning curve (left) and parameter study (right) using TD(0). The top and bottom axes in the parameter study show the step size values for linear function approximation and neural nets respectively. TCj-Lin converged to a higher asymptotic performance compared to other methods. Neural networks alone (without any input transformation) slowly converged to a good final performance. The proposed methods were fast and converged to a low final performance.
Figure 5: Results on two control problems with . Proposed methods were fast and converged to a good asymptotic performance on both problems. Neural nets alone (without any input transformation) were unable to solve the mountain car problem due to catastrophic interference.

The first testbed was Mountain Car in the prediction setting, involving the evaluation of a fixed policy. The policy was to push towards the direction of velocity. Neural nets that were used with different input transformation methods, had different numbers of inputs and hidden units. NN had 2 raw inputs: position and velocity. LPj-NN had 3 inputs: position, velocity, and its extra dimension. TCj-NN used a feature size of 80 (see Appendix A for details of tile coding and why the number of features is 80). TCj-Lin had the same number of features as TCj-NN. In this problem we carefully chose the number of hidden units to make sure all methods had almost the same number of weights in the neural net. We gave NN and LPj-NN 135 and 100 hidden units to create networks with a total of 405 and 400 weights respectively. We gave TCj-NN only 5 hidden units, which resulted in a network with 405 weights.

We ran each method under each parameter setting for 30 independent times (30 runs). Each run had 2000 episodes. We then averaged over runs to create learning curves. We also performed a parameter study over different step sizes: for each

and each run, we computed an average over the last 5% of episodes, which produced 30 numbers – one for each run. We then computed the mean and standard error over the resulting numbers. We used our parameter study results to choose the value of the step size for the learning curves we plotted. For all methods, we chose the largest step size (and thus fastest convergence) for which the final performance was close to the best final performance of that method. We used an estimation of the root mean square value error as the error measure:

Here is a set of states that is formed by following to termination and restarting the episode and following again. This was done for 10,000,000 steps, and we then sampled 500 states form the 10,000,000 states randomly. The true value was simply calculated for each by following once to the end of the episode.

Results on the Mountain Car prediction (Figure 4) show that NN had a good final approximation of the value function; however, it was slow. TCj-Lin was fast but it could not approximate the value function as accurately as other methods. TCj-NN and LPj-NN were both fast and approximated the value function accurately. LPj-NN made the most accurate approximation.

The second testbed was Mountain Car in the control setting. We used -greedy Sarsa() with . The performance measure was the number of steps per episode, which is equal to the negative return. We did 30 runs. Each run had 500 episodes. NN, TCj-NN and LPj-NN all had 800 hidden units. We used the same tile coding scheme as in the Mountain Car prediction problem.

All methods except NN learned to successfully reach the goal. NN could not solve the task with raw inputs. TCj-NN was the fastest method to achieve its best final performance, LPj-NN came second and TCj-Lin was the slowest (see Figure 5(a)).

The third testbed was Acrobot in the control setting. We used -greedy Sarsa() with . Performance measure was similar to Mountain Car control. TCj-Lin and TCj-NN had a feature size of 256 (see Appendix A for tile coding details). TCj-NN used 4000 hidden units. NN and LPj-NN had 2000 hidden units. LPj-NN fed the neural net with 5 inputs (original dimensions plus one extra dimension). We did 30 runs. Each run had 500 episodes.

The best final performance was achieved by TCj-NN, followed by LPj-NN, then NN and then TCj-Lin. Speed wise, LPj-NN and TCj-NN were the fastest methods. Figure 5(b) summarizes the results.

Figure 6: Parameter studies comparing separate and joint input transformation. Transforming the input dimensions jointly or separately did not affect the performance of the proposed methods; however, it affected the performance of tile coding plus linear function approximation.

We also did a parameter study on the Mountain Car prediction problem to study the effects of transforming the dimensions of the input jointly and separately. The final performances of LPs-NN and LPj-NN were similar, so were the final performances of TCj-NN and TCs-NN. However, the final performance of TCs-Lin was worse than that of TCj-Lin. This confirms one of our assumptions from Section 3: tile coding the input dimensions separately (vs jointly) does not pose generalization restrictions (and does not affect the final performance) when combined with neural nets. However, it does pose restrictions (and affects the final performance) if combined with linear function approximation. See Figure 6 for these results when and , and see Appendix D for results on other values of .

As a starting point for working in higher dimensions, we applied our methods to a real world robot task in the off-policy setting. In this problem, a Kobuki robot wanders in a pen, learning to predict when it will bump into something if it goes forward. More details about the reward and policies can be found in Appendix A. The sensory information available to the robot to learn this task consisted of 50 RGB pixels from its camera, represented as a vector of size 150. We did 30 runs of 12000 time steps and used an estimation of root mean square return error as the performance measure:

Here is a set of state and return pairs selected according to the following procedure. To sample each pair , the robot followed the behavior policy for some random number of steps, sampled state , and followed the target policy from to compute the true return . After sampling each pair, the robot switched back to the behavior policy for a random number of time steps to get the next sample. We repeated this whole procedure 150 times to construct .

Figure 7: Learning curve and parameter study for our proposed methods on the robot collision task. Results show that our methods can be effective in higher dimensional spaces.

We tile coded each of 150 numbers separately. Tile coding produced a feature vector of size 9600. LPs-NN had 300 features. Both TCs-NN and LPs-NN had 1000 hidden units. More experimental details can be found in Appendix A. Our methods worked well in this environment. The results are presented in Figure 7 (TCs-Lin and NN also performed well; their results are not shown here).

We also studied the effect of using larger values of . Eligibility traces has been shown to be effective in the past [Rummery and Niranjan1994, Sutton1996]. Our results (in Appendix C) confirm that larger values of help all methods (except RBFs) to learn faster and more accurately. One of the reasons can be that eligibility traces carry the past information and this can help prevent interference. Eligibility traces are not computationally expensive and can be an alternative to experience replay.

6 Conclusions and future work

In this paper, we took a step towards understanding the catastrophic interference problem as it arises in online reinforcement learning. We showed that the two geometric input transformation methods, tile coding and EmECS, can help improve the online learning performance of single hidden layer neural nets with ReLU gates. Reinforcement learning systems using these transformation techniques can learn, in a fully incremental online manner, to successfully accomplish a task. While our focus was on reinforcement learning, these two methods can also be applied in supervised learning with neural nets.

Future work is to develop both methods further and understand them better. Our ongoing research includes: to experiment with different embeddings for EmECS and study their effects; to test both methods on larger problems; and to use them with multilayer neural nets and hierarchical neural nets, for reducing the learning interference of individual nodes at higher hidden layers of these networks. A simple idea for combining EmECS and neural nets with multiple hidden layers is to apply EmECS to each hidden layer. Experimenting these methods with other activation functions such as other variants of ReLU and sigmoidal functions can also lead to interesting observations.

Another interesting future research direction is to try to understand if proposed methods can make improvements on a system that uses experience replay. Since experience replay uses batch data, it is unfair to compare it directly with online reinforcement learning algorithms. However, a more meaningful comparison could be adding experience replay to the current methods and comparing the resulting methods with a system that uses experience replay to overcome interference.

Our proposed methods provide a new point of view on the problem of interference and its possible solutions, and there is a wide range of research directions that can be pursued.

Acknowledgments

The authors gratefully acknowledge funding from Alberta Innovates–Technology Futures, the Natural Sciences and Engineering Research Council of Canada, and Google DeepMind.

References

  • [Albus1975] Albus, J. S. 1975. Data storage in the Cerebellar Model Articulation Controller (CMAC). Journal of Dynamic Systems, Measurement and Control 97(3):228–233.
  • [Albus1981] Albus, J. S. 1981. Brains, Behavior, and Robotics. Peterborough, NH: BYTE Publications.
  • [Balakrishnan and Weil1996] Balakrishnan, S. N., and Weil, R. D. 1996. Neurocontrol: A literature survey. Mathematical and Computer Modelling 23(1-2):101–117.
  • [Brockman et al.2016] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. arXiv preprint arXiv:1606.01540.
  • [Cybenko1989] Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems 2:303–314.
  • [Engelking1989] Engelking, R. 1989. General Topology. Berlin: Heldermann Verlag, revised and completed edition.
  • [French1999] French, R. M. 1999. Catastrophic forgetting in connectionist networks: Causes, consequences and solutions. Trends in Cognitive Sciences 3(4):128–135.
  • [Hinton, McClelland, and Rumelhart1986] Hinton, G. E.; McClelland, J. L.; and Rumelhart, D. E. 1986. Distributed representations. In Parallel distributed processing: Explorations in the microstructure of cognition. Cambridge, MA: MIT Press. chapter 3, 77–109.
  • [Lee et al.2017] Lee, S.-W.; Kim, J.-H.; Jun, J.; Ha, J.-W.; and Zhang, B.-T. 2017.

    Overcoming catastrophic forgetting by incremental moment matching.

    Advances in Neural Information Processing Systems 4652–4662.
  • [Lin and Kim1991] Lin, C., and Kim, H. 1991. CMAC-based adaptive critic self-learning control. IEEE Transactions on Neural Networks 2(5):530–533.
  • [Lin1993] Lin, L.-J. 1993. Reinforcement learning for robots using neural networks. Technical report, Carnegie-Mellon Univ Pittsburgh PA School of Computer Science.
  • [McCloskey and Cohen1989] McCloskey, M., and Cohen, N. J. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation 24:109–165.
  • [Miller, Glanz, and Kraft1990] Miller, W. T.; Glanz, F. H.; and Kraft, L. G. 1990. CMAC: An associative neural network alternative to backpropagation. Proceedings of the IEEE 78:1561–1567.
  • [Miller, Sutton, and Werbos1990] Miller, W. T.; Sutton, R. S.; and Werbos, P. J., eds. 1990. Neural Networks for Control. Cambridge, MA: MIT Press.
  • [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529–533.
  • [Paszke et al.2017] Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017.

    Automatic differentiation in pytorch.

  • [Puterman1994] Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York: John Wiley & Sons.
  • [Rockafellar1970] Rockafellar, R. T. 1970. Convex Analysis. Princeton, NJ: Princeton University Press.
  • [Rummery and Niranjan1994] Rummery, G., and Niranjan, M. 1994. On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Department of Engineering, University of Cambridge.
  • [Silver et al.2016] Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489.
  • [Sutton and Barto2018] Sutton, R. S., and Barto, A. G. 2018. Reinforcement Learning. Cambridge, MA: MIT Press, 2nd edition.
  • [Sutton1988] Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3:9–44.
  • [Sutton1996] Sutton, R. S. 1996. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems (NIPS) 8.
  • [Tesauro1995] Tesauro, G. 1995. Temporal difference learning and TD-Gammon. Communications of the ACM 38(3):58–68.
  • [Tham1994] Tham, C. K. 1994. Modular On-Line Function Approximation for Scaling up Reinforcement Learning. Ph.D. Dissertation, University of Cambdridge.
  • [Wang et al.2018] Wang, T.; Liao, R.; Ba, J.; and Fidler, S. 2018. NeverNet: Learning structured policy with graph neural networks. In The 6th International Conference on Learning Representations (ICLR).

Appendix A Experimental details

For implementing neural networks in different settings we used the PyTorch software [Paszke et al.2017].

Mountain Car prediction:

The original Mountain Car problem has a 2-dimensional state space: position and velocity. The position can vary between and , and the velocity varies between and . There are three actions in each state, full throttle forward, full throttle backward and no throttle. The car starts around the bottom of the hill randomly (uniformly chosen position between and ). The reward is for each time step before getting to the goal state at the top of the hill when the position becomes larger than .

Our fixed-policy Mountain Car testbed used the same environment as the original one only with a fixed policy. The policy was to push towards the direction of velocity.

We applied four methods along with nonlinear TD() to approximate the value function of the policy. Each method had different parameters. TCj-Lin and TCj-NN both used 5 tilings, each of which had tiles. NN used normalized (between and ) raw inputs as its input to the neural net.

LPj-NN normalized the input space (between and ) and then lift-and-projected the input space (as explained in Example 1(a)) to generate transformed features for the neural network. For this method we used a radius in the lift-and-project map. After performing the lift-and-project, we shifted the origin in the extra dimension by so that the hyperplanes associated with hidden units of the neural network are close to the transformed inputs. This made it easier for each node to respond to a specific part of the transformed space. For LPj-NN and LPs-NN we made sure that the normal vectors of the hyperplanes created by neural net weights are initialized such that they pointed “upwards” with respect to the extra dimensions (this is done for all experiments).

The network’s weights for all methods were initialized using a normal distribution with mean 0 and standard deviation

. We initialized the biases by drawing random numbers from a normal distribution with mean and standard deviation .

RBF and sparse RBF features:

We also used RBF kernels in the Mountain Car prediction problem to transform the input features. RBF features are created using the following formula:

where is the th RBF feature, is the center and is the width. In our experiments we used 50 and 100 centers. These centers were chosen randomly uniformly in the state space. We used two different widths (with respect to normalized input): and . (The results for are not presented in paper because they were always worse that the results for .) The neural nets were initialized as before. For each state in the state space, its RBF features were first computed with respect to all of the centers (which creates a feature of size or in our case) and then used as the input to the neural net.

We also used another version of RBF which we call sparsified RBF. In this version, we found the features that are smaller than and set them to . This process made the input sparse. We refer to this method in figures as SRBF.

Mountain Car control:

We again used the classic Mountain Car domain but this time in the control setting. We used the same number of tiles and tilings in this experiment. LPj-NN also used the same parameters as in the prediction case. Weights and biases of the networks for all methods were initialized with a normal distribution with mean and standard deviation . In this problem, if the episode took more than 1000 steps, it was terminated manually.

Acrobot:

The Acrobot is similar to a gymnast. Its goal is to swing its feet above the bar its hanging from, in order to end the episode successfully. A state consists of two angles and two angular velocities. There are three actions: positive, negative, and no torque. The reward is for each time step before the episode ends.

We used the Acrobot problem from Open AI Gym [Brockman et al.2016]. We created the tile coded features for this problem using the freely available tile coding software from Richard Sutton’s web page. We used tilings, each of which was of size since Acrobot has dimensions, tiles for each (we used a memory size of ). All neural networks were initialized the same way as in the Mountain Car control problem. LPj-NN used the same parameters as in the previous problems. If the episode took more than steps, it was terminated manually.

Robot prediction problem:

Here we provide the details on the robot prediction problem (which we also refer to as the collision task). The robot wants to learn when it will bump into something if it goes forward. If the robot’s sensors detected a bump, the reward became and became , otherwise reward and took the values of and respectively. The target policy was to go forward in all states. The behavior policy was to go forward 90% of the time and turn left 10% of the time.

We used the same tile coding software that we used for Acrobot. To tile code each of the pixels, we used tilings, each of which has tiles (and a memory size ). The size of the final feature vector was .

LPs-NN used a radius in the lift-and-project map. After lifting-and-projecting the input, we shifted the origin in every extra dimension by . We initialized all weights and biases for all neural networks of different methods with a normal distribution of mean and standard deviation .

Appendix B Learning interferences in online training of neural nets

On the Mountain Car problem, neural nets operating on the raw state inputs suffered severe learning interferences during online training. In most cases, they failed to learn at all; occasionally, they succeeded but only after a long period of training. A sample of six learning curves is plotted in Figure 8 to illustrate this behavior. Each curve corresponds to one run of Sarsa() with . The network parameters are the same for all six curves; in particular, the hidden layer has units, and the step size .

Figure 8: Learning interferences occurred in online training of neural nets with raw state inputs, on the +Mountain Car+ problem. Top row: three failed runs; bottom row: three partly successful runs.

We also observed learning interferences in neural nets that operate on RBF features, on the Mountain Car control problem. Figure 9 plots the learning curves of six sample runs of Sarsa(), where the neural net has hidden units and operates on RBF features that are generated with width parameter (with respect to the normalized inputs), and the step size used is about , one of the best step sizes based on our parameter study. A similar sample of six learning curves for neural nets with sparsified RBF features is plotted in Figure 10, where the step size used to obtain these curves is about , one of the best choices according to our parameter study, and sparsified RBF features are created from RBF features by replacing small values by (see Appendix A for details).

Figure 9: Sample learning curves of neural nets with RBF features, on the +Mountain Car+ problem. Top row: troublesome runs; bottom row: better runs.
Figure 10: Sample learning curves of neural nets with sparsified RBF features, on the +Mountain Car+ problem. Top row: troublesome runs; bottom row: better runs.
Figure 11: Parameter study for neural nets with and RBF or sparsified RBF (SRBF) features, on the +Mountain Car+ problem. The performance of LPj-NN is plotted for comparison.

Figure 11 shows the learning performance of neural nets for a range of step size parameters, when they operate on RBF or sparsified RBF (SRBF) features with width parameter . (For the details about how we did this type of parameter study, see the experimental results section.) As can be seen, with RBF/SRBF features, neural nets suffered severe learning interferences, for both small and large values.

The behavior of neural nets with RBF features shown above sharply contrasted with that of LPj-NN (which used only

features). Note that in both cases, the input transformations involved are topological embeddings. The difference is that RBF features lack the other important aspect of EmECS: with radial basis functions, the inputs are not mapped to extreme points of a convex set. Instead they lie on a low-dimensional surface which can be so curvy that it becomes hard for individual neural nodes with ReLU gates, whose activation regions are half-spaces determined by hyperplanes, to separate a local region on this curvy surface from other parts of the surface (cf. Figure 

(a)a).

Similarly, the behavior of neural nets with sparsified RBF features is in sharp contrast with that of TC-NN. Now in both cases, the input transformations create sparse features based on the geometrical structure of the state space. Again, the difference is that the sparsified RBF features do not lie among extreme points of a convex set, whereas the tile-coded features do and this, we believe, has helped reduce learning interferences and improved considerably the performance of the neural net.

Although it is hard to analyze the complex behavior of the entire network, to partially verify that what we just pointed out is a major reason why these neural nets with RBF/SRBF features failed, we plotted in Figure 12 the response functions of several nodes, after training these neural nets on the Mountain Car control problem. The majority of the node response functions we found are global. Even the relatively local ones tend to have strong responses to multiple spots in the state space that are far apart from one another. We plotted some of them in the two columns on the left side of the figure.

These node response functions from neural nets with RBF/SRBF features can be contrasted with the node response functions of neural nets with tile coding or EmECS shown by Figure 14 in the appendix on comparison of TC-Lin and TC-NN and Figure 18 in the appendix on comparison between EmECS and coarse coding.

(a) RBF-NN
(b) SRBF-NN
Figure 12: Response functions of neural nodes visualized on the original state space, for two neural nets operating on RBF or SRBF features, after training them on the task +Mountain Car+. The horizontal (vertical) axis in the plots is position (velocity). The color scheme is such that if we normalize the responses to be between and , dark red is , green is , and dark blue is .

Appendix C Larger values of trace parameter

In this appendix we provide the experimental results for larger values of the trace parameter , for three different prediction and control tasks that we considered in Section 5. More specifically, we provide the results for and and for four algorithms: tile coding jointly with neural networks (TCj-NN), lift-and-project jointly with neural networks (LPj-NN), tile coding with linear function approximation (TCj-Lin), and neural networks with raw inputs (NN). The results are shown in Figure 13. We also performed experiments with , the results of which are not shown here because they were similar to .

As can be seen from these figures, although all methods got better in terms of speed and final performance, NN was still slower than the other methods across all tasks. In Mountain Car prediction, LPj-NN and TCj-NN maintained their superior performance as larger values of were used. In Mountain Car control, all methods had a similar final performance for each specific value of , except for NN which could not solve the task. However, as increased, each method achieved faster learning and better final performance, compared to when it used a smaller . In the Acrobot task, TCj-NN outperformed other methods in terms of speed and final performance, when or . When , the performance of all methods were close to each other.

Figure 13: Learning curves and parameter studies for two different values of .

Appendix D Comparison of TC-Lin and TC-NN

Here, we continue our discussion from Section 3 of the paper. We first show experimentally how TC-Lin and TC-NN are different. We then provide an intuitive explanation of why TCs-Lin suffers from generalization restrictions whereas TCs-NN does not. After that we use neural node response function (similar to the ones in Appendix B) to explain why TCs-NN performs as well as TCj-NN.

We compare TCs-Lin, TCj-Lin, TCs-NN, and TCj-NN for all values of on the Mountain Car prediction problem. (Results of LPs-NN, and LPj-NN are also shown). Our results in Figure 16 show that for all values of , TCs-Lin is worse than TCj-Lin in terms of asymptotic performance. TCs-Lin’s range of step size for which it converges is smaller than TCj-Lin. However, TCs-NN and TCj-NN (and also LPs-NN and LPj-NN) achieve the same performance for the same values of step sizes. These results show that TC-NN outperforms TC-Lin for all values of .

When we tile code different dimensions of the input space separately and combine them with linear function approximation, the resulting method (TCs-Lin) has a rather critical limitation: it cannot take into account how features from different dimensions interact with each other. For example, in the continuous 2D space, it cannot express if a feature from the first dimension is good when the feature from the second dimension has a specific value. Whenever TCs-Lin updates the weights corresponding to feature in the first dimension, it generalizes for and all features of the second dimension, which might not be ideal. TCj-Lin solves this problem by coding both dimensions at the same time and capturing the neighborhood information in both dimensions. Neural nets generalize differently and therefore do not encounter the aforementioned problem. A neural net with ReLU gates can choose to generalize for feature from the first dimension when feature from the second dimension has a specific value or, in the case of tile coding, when feature is absent (equal to zero) even if the input dimensions are tile coded separately. This is because in a fully connected network, all the features are gathered together at each node, and each node can respond to the features of its choice. The conclusion is that when the input space is tile coded separately, a linear function approximation method has a limitation that neural nets do not have.

Now let us focus on the neural node response functions for TCs-NN and TCj-NN and compare them to those of NN. Figure 14 shows some response functions for TCj-NN and TCs-NN. We followed the same procedure as in Appendix B to create these figures. The figures show that the response functions for TCj-NN and TCs-NN are similar, meaning that when the input is tile coded separately or jointly, the neural net still generalizes between states in a similar fashion.

Figure 14: Response functions for tile coding plus neural nets on the +Mountain Car+ control problem.
Figure 15: Response functions for NN with raw inputs for the +Mountain Car+ control problem.
Figure 16: Parameter studies for different values of .

One important characteristic of TCj-NN and TCs-NN is that the nodes tend to focus their responses to a connected region. The reason is that when the inputs are tile coded, they are mapped to vertices of a hypercube in a higher dimension. Although the state space does not preserve its shape in this transformation, the neighborhoods that are close in the original space, tend to have a small Hamming distance between their binary representations and therefore they are still close to each other on the hypercube. This means that the nodes respond to neighborhoods that are close in the original space. However, there can still exist nodes that respond to different regions of the space (see the top right figure of TCj-NN in Figure 14).

When response functions from TCj-NN and TCs-NN are compared to the ones for NN in Figure 15, one can observe that the response of a single node in NN is linear within its activation region while TCj-NN or TCs-NN approximate a more complex function that is not linear with respect to the original input space in their activation regions.

Appendix E Comparison between EmECS and coarse coding

In Section 4 we briefly discussed some similarities between EmECS and coarse coding and how the choice of the embedding in EmECS can influence the shapes of the activation regions of nodes in the neural nets when they operate on transformed inputs. This appendix is a longer version of that discussion, with more details and illustrations.

In terms of local generalization, EmECS is similar in spirit to coarse coding, despite their being seemingly unrelated. As explained in Section 3 and illustrated in Figure 17, in coarse coding we cover the state space with overlapping sets (circles, in this case), referred to as receptive fields (of the corresponding features). Observations at state will activate the three shaded receptive fields that contain . The union of these fields delineates the region in which generalization occurs, and this region is connected and composed of “nearby” states.

When we apply EmECS, what resemble the generalization regions in coarse coding are the activation regions of each neural node, viewed in the original input space. Because of the embedding property, these activation regions are also connected (like in coarse coding), if the hyperplanes associated with the nodes have their normal vectors point in the right directions in the transformed input space, as we mentioned in Section 4. For instance, with the lift-and-project map in Example 1(a), it suffices that these normal vectors point “upwards” with respect to the extra -th dimension, i.e., the -th component of every normal vector is nonnegative. Similarly, if we use the separate embedding scheme described in Example 1(c) with lift-and-project maps, it suffices that those normal vector components associated with the extra dimensions are nonnegative.

If the physical state space is not but a connected subspace of , then a connected activation region in , when it is too large, can still contain disconnected sets of physical states. However, by the geometric property of extreme points discussed at the beginning of Section 4 and by the embedding property, EmECS ensures that as the activation region (viewed in ) becomes sufficiently small, it will contain only a connected subset of physical states.

If a node is activated by an input and the network is updated with gradient-descent using that input and a small step size, the activation region as well as the response of the node within that region will be modified slightly. Thus when carrying out generalization, both coarse coding and the neural nodes in EmECS respect the topological structure of the original input space, although coarse coding does that in a coarser way.

Figure 17: Similarities between receptive fields in coarse coding and activation regions in EmECS. Top: receptive fields (in this case, circles) in coarse coding. Bottom: activation regions (ellipses, in this case) of three neural nodes in EmECS, viewed in the original input space. Each node responds with greater intensity (darker shading) to inputs lying deeper inside its activation region.

In coarse coding, receptive fields can have different sizes and shapes. So are the activation regions in EmECS, as illustrated schematically in Figure 17. However, the receptive fields in coarse coding are chosen in advance before learning takes place, whereas in EmECS, the activation regions change their locations and sizes dynamically during learning. Their shapes depend partly on the embedding, so by choosing the embedding, we can have influence over them, like in coarse coding.

As a simple example, if we scale the coordinates of to create the space and then embed using the lift-and-project map in Example 1(a), then viewed in the original space , the activation regions will be asymmetric, wider in some dimensions and narrower in others, and their shapes will also depend on where they are located. As another example, the separate embedding scheme in Example 1(c) gives the network more freedom to produce activation regions that are asymmetric, whereas a joint embedding scheme like Example 1(a) can be used to enforce symmetric or specific asymmetric shapes.

Finally, to give a sense of what these activation regions actually look like when the neural nets have been trained on a task, we plotted in Figure 18 the response functions of a few neural nodes on the original input space, after the neural nets solved the Mountain Car control problem using Sarsa(). Two types of maps are used to apply EmECS in this experiment. The first one is the lift-and-project map given in Example 1(a); the corresponding algorithm is denoted LPj-NN. The second one is the separate embedding scheme given in Example 1(c), where we embed the two dimensions of the state space separately, and we take the component maps to be lift-and-project maps given in Example 1(a). The corresponding algorithm is denoted LPs-NN.

The node response functions for LPj-NN and LPs-NN are plotted in Figure 18. The activation regions have different shapes under LPj-NN and LPs-NN. They are located across the state space, each focusing on some part of the space. However, instead of localized to small neighborhoods of particular states, they tend to be spread-out.

Figure 18: Response functions of neural nodes visualized on the original state space, for two trained neural nets operating on inputs transformed by EmECS with joint (LPj) and separate (LPs) embedding schemes. The task is +Mountain Car+; the horizontal (vertical) axis in the plots is position (velocity). The color scheme is such that if dark red represents , then green is and dark blue is .

As to why these neural nets prefer large activation regions for each node, one explanation could be that each node was initialized with a random global activation region and it was hard for the nodes to coordinate with each other to shrink their activation regions to small neighborhoods of specific states. Another possible reason is that having many small activation regions makes it harder to approximate well the function values at the boundaries of the activation regions, and it is actually easier for the neural nets to get good approximations by having a spread-out response from each node. Further investigations are needed to better understand this behavior of neural nets.