Reinforcement learning was originally developed for Markov Decision Processes (MDPs). It allows an agent to learn a policy to maximize a possibly delayed reward signal in a stochastic environment and guarantees convergence to an optimal policy, provided that the agent can sufficiently experiment and the environment in which it is operating is Markovian.
In many real world problems, however, the agent cannot directly perceive the full state of its environment and must make decisions based on incomplete observations of the system state. This partial observability introduces uncertainty about the true environment state and renders the problem non-Markovian from the agent’s point of view. One way to deal with partially observable environments is to equip the agent with a memory of past observations and actions in order to help it discover what the current state of the environment is. This memory can be implemented in a variety of ways, including explicit history windows [9, 10]
, but this article only focuses on reinforcement learning using recurrent neural networks for function approximation. Unlike basic feed-forward networks, recurrent neural networks can contain cyclic connections between neurons. These cycles give rise to dynamic temporal behavior, which can function as an internal memory that allows these networks to model values associated with sequences of observations[7, 4, 1]. This paper aims at comparing different recurrent neural architectures when used to model value functions in a reinforcement learning context.
The next section provides necessary background on reinforcement learning and the recurrent network architectures compared in this paper. Section 3 describes the experimental setup and environments used for the comparison. The empirical results are provided in Section 4. Finally, we conclude in Section 5.
Discrete-time reinforcement learning consists of an agent that repeatedly senses observations of its environment and performs actions. After each action , the environment changes state to and the agent receives a reward and an observation . The agent has no knowledge of and and has to interact with its environment in order to learn a policy
, that gives the probability distribution of taking each of the actions for any given observation. The optimal policyis the one that, when followed by the agent, maximizes the cumulative discounted reward , with .
When the reward received by the agent depends solely on its current observation and action, the problem is reduced to a Markov decision process and is said to be completely observable (the agent can assume that without losing learning abilities). Partially observable Markov decision problems occur when the reward does not depend only on , but on state , whose dynamics still obey some underlying MDP, but that the agent cannot observe directly. In this case, , with an unknown one-way function part of the environment.
2.1 Q-Learning and Advantage Learning
Q-Learning estimates thefunction, that maps each state-action pair to the expected, optimal cumulative discounted reward reachable by taking action given observation . At each time step, the agent observes , takes action and observes and . Equation 1 is used to update the Q function after each time step, with the learning factor.
Advantage Learning  is related to Q-Learning, but artificially decreases the value of non-optimal actions. This widens the difference between the value of the optimal action and the other ones, which allows learning to converge more easily even if the values are approximated (using function approximation). Equation 3 is used to update the Advantage values at each time step . The smaller is, the widest the gap between the optimal and non-optimal actions becomes.
In very large or even continuous environments, exact representation of the Q-function (or Advantage function) is no longer possible. In these cases a function approximation architecture is needed to represent the target function. It has been shown, however, that on-line Q-Learning can diverge, or converge very slowly, when used in combination with function approximation . One solution to this problem is to learn the -values off-line. The method used in this paper is the neural fitted Q iteration described in , an adaptation of fitted Q iteration  using neural networks. The agent interacts with its environment using a fixed policy until reaching the goal or a maximum number time steps have elapsed, and collects samples of the form . After a number of episodes have been run, the model is trained in batch on the collected data. The model maps sequences of observations to action values: .
The next subsections describe the different recurrent network architectures that we consider in this paper to represent the target functions.
2.2 Long Short Term Memory
An LSTM  cell stores a value. An output gate allows the cell to modulate its output strength, while an input gate controls the intensity of the input signal that is continuously added to the cell’s content. A forget gate, when set to zero, clears the content of the cell. Equations 5 to 7 show how the values of the gates are computed. Equations 8 and 9 show how to compute the value of the memory cell, and Equation 10 shows the output of an LSTM cell.
, Bakker proposes a neural network architecture tailored for reinforcement learning. The network has one input neuron per observation variable, and one output neuron per action. A softmax layer transforms the output of the neural network to a probability distribution over the actions. The neural network itself consists of an LSTM layer and a simple tanh layer working in parallel: the input of the network is fed to the LSTM and tanh layer, both these layers are connected to the output, the output of the tanh layer is connected to the input of the LSTM layer and the output of the LSTM layer is connected to the input of the tanh layer (see Figure1). This article uses a simpler version of the network: the input is connected to a tanh layer, that is in turn connected to an LSTM layer, that is connected to the output. Both the tanh layer and the LSTM layer contain 100 neurons (or LSTM cells). 111
The models themselves are built on Kerashttp://keras.io/
, a Python library providing neural network primitives based on Theano. Keras provides LSTM, GRU, MUT and dense fully-connected weighted layers (among others). Layers can be assembled either in a stack or in a directed acyclic graph. The connection scheme in  makes the network layer graph cyclic, and hence impossible to build using the current version of Keras.
2.3 Gated Recurrent Unit
GRU has been introduced recently and follows a design completely different from LSTM [3, 4]. Instead of storing a value in a memory cell and updating it using input and forget gates, a GRU unit computes a candidate activation based on its input, and then produces an output that is a blend of its past output and the candidate activation. Equations 11 and 12 show how the Z (modulation) and R (reset) gates are computed. Equations 13 and 14 show how the input is mixed with the last activation in order to produce the candidate activation, and Equation 15 shows how the last activation and the candidate activation are mixed to produce the new activation.
Józefowicz et al. observed that GRU and LSTM are very different from each other, and wondered whether other recurrent neural architectures could be used. In order to discover them, they developed a genetic algorithm that evaluated thousands of recurrent neural architectures. Once the experiment was finished, they identified three architectures that performed as good as or better than LSTM and GRU on their test vectors: MUT1, MUT2 and MUT3.
This paper only considers MUT1, that produced the best results on preliminary experiments. Equations 16 and 17 show to compute the value of the Z and R gates, Equations 18 and 19 show how to compute the candidate activation, and Equation 20 shows that the output of the MUT1 uses the same type of mixing as the one used by GRU.
3 Experimental Setup
In order to keep training time manageable, the neural networks are trained to associate values with the last 10 observations, instead of the complete history. LSTM, GRU and MUT1 are able to associate values to arbitrarily long sequences of inputs, but Keras requires all the sequences on which it is trained to have the same length (possibly with padding).
Training has to be done carefully, because one does not want the model to forget past experiences when a new batch of episodes is learned. The network has been configured to perform 2 training epochs on the data, using a batch size of 10 (batches of 10
samples are used to compute an average gradient when performing backpropagation). The small number of epochs prevents the model from overfitting specific episodes.
Three environments are used to evaluate the neural network models. The first one is a simple fully-observable grid world with the initial position at , the goal at and an obstacle at (see Figure 2). The agent can observe its coordinates. It receives a reward of at each time step, if it hits a wall or the obstacle, and when it reaches the goal.
The second environment is based on the same grid world as the first one, but the agent can only observe its coordinate. The coordinate is masked to zero.
The last environment is also based on the grid world, but the agent can only observe its orientation (whether it is facing up, down, left or right, expressed as a 0 to 3 integer number) and the distance between it and the wall in front of it. This agent-centric environment is very close to what actual robots can experience.
The “stochastic” variant of the experiments uses a random initial position for every episode. The agent can sense its initial coordinate at the first time step, even in otherwise partially observable environments222Some experiments have been re-run without this hint, with no change in the results. The agent learns to look left, then up, and uses those observations as initial position..
The observations of the agent, that consist of integer numbers, are encoded using a one-hot encoding so that they are more easily processed by neural networks. For instance, the y coordinate of the grid world can take values from 0 to 4, which are encoded as. For the grid world, the neural networks therefore have 15 input neurons.
Each experiment consists of 5000 episodes of a maximum of 500 time steps. During the episodes, the neural network is not trained on any new data, but values are computed based on and stored in a list. After every batch of 10 episodes, the neural networks are trained on the values, as described in  and shown in Figure 3.
The experiments themselves consist of trying to reach the goal in one of the environments described in Section 3.1. Each experiment is run 15 times for each combination of the following parameters:
Value iteration: Q-Learning and Advantage learning, , and
Neural network architecture: feed-forward perceptron with a single hidden layer (nnet), LSTM (lstm), GRU (gru) and MUT1 (mut1)
World: gridworld (gw), partially observable gridworld (po) and agent-centric gridworld (ac)
Fixed initial position and random initial position
Softmax action selection with a temperature of 0.5
4 Empirical Results
Each experiment (see Section 3.2) is run 15 times. The first time step at which the agent is able to maintain an average (over the 1000 next time steps) reward of more than
with a standard deviation less than 20 is called thelearning time. The best average reward obtained during a 1000-time-steps window is called the learning performance.
Table 1 shows the learning time of the different neural networks for all the experiment configurations. Best results are emphasized in bold. Results are displayed in a mean/stddev format.
Advantage Learning leads to smaller learning times and standard deviations than Q-Learning in all worlds except the partially observable grid world. Figure 4 shows the behavior of the Q and Advantage learning algorithms in the partially observable grid world. Q-Learning allows faster convergence with a smaller standard deviation.
When using a fixed initial position, GRU learns faster than any other network. The difference of learning speed between GRU and LSTM is statistically significant for (gw, Advantage), (gw, Q-Learning), (po, Q-Learning) and (ac, Q-Learning) (p-values of 0.003, 0.008, 0.0003 and 0.0003, respectively), but not for (po, Advantage) and (ac, Advantage) (p-values of 0.118 and 0.140, respectively).
When using a random initial position, GRU is the only model allowing learning in all the environments when Advantage Learning is used. LSTM and GRU give comparable results in the partially observable worlds, with no statistically significant difference between them.
Agents using MUT1 as a function approximator nearly always manage to learn a good enough policy in partially observable worlds, but they need a large number of episodes to do so. However, plain perceptron-based agents don’t manage at all to learn a policy in these worlds333Except in the partially observable grid world using Q-Learning and random initial positions, where the agent learns to go left, then randomly go up and down until the goal is reached by chance., which shows that MUT1 allows better learning in partially observable worlds than a simple non-recurrent neural network.
Table 2 shows the learning performance of the different neural networks, with the highest values highlighted in bold. Results are displayed in a mean/stddev format.
The feed-forward neural network always achieves the best scores in the grid world, followed by GRU, then LSTM, and finally MUT1. GRU always outperforms the other network architectures in the partially observable worlds. In these worlds, GRU is statistically significantly better than LSTM in all cases (p-value less than 0.0001) except in (random initial, ac, Q-Learning) (p-value of 0.682).
LSTM, GRU and MUT1 have been compared on simple reinforcement learning problems. It has been shown that agents using LSTM and GRU for approximating Q or Advantage values perform significantly better than the ones using MUT1, obtaining higher rewards and learning faster.
GRU and LSTM provide comparable performance, with GRU often being significantly better than LSTM. LSTM is never significantly better than GRU. When considering the rewards received by the agents once they have learned, and not the time required for learning, GRU always achieves better results than LSTM.
This shows that using GRU instead of LSTM should be considered when tackling reinforcement problems. Moreover, on the machine used for the experiments, the simpler GRU cell (compared to LSTM) allowed the GRU-based agents to complete their 5000 episodes approximately two times faster than LSTM-based agents.
-  B. Bakker. Reinforcement Learning with Long Short-Term Memory. In Advances in Neural Information Processing Systems 14, pages 1475–1482, 2001.
-  J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU Math Expression Compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), 2010.
K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau,
F. Bougares, H. Schwenk, and Y. Bengio.
Learning Phrase Representations using RNN Encoder-Decoder for
Statistical Machine Translation.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724–1734, 2014.
-  J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR, abs/1412.3555, 2014.
D. Ernst, P. Geurts, and L. Wehenkel.
Tree-Based Batch Mode Reinforcement Learning.
Journal of Machine Learning Research, 6:503–556, 2005.
-  M. E. Harmon and L. C. Baird III. Multi-player residual advantage learning with general function approximation. Wright Laboratory, WL/AACF, Wright-Patterson Air Force Base, OH, 1996.
-  S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997.
-  R. Józefowicz, W. Zaremba, and I. Sutskever. An Empirical Exploration of Recurrent Network Architectures. In Proceedings of the 32nd International Conference on Machine Learning, pages 2342–2350, 2015.
-  L.-J. Lin and T. Mitchell. Reinforcement Learning with Hidden States. In From animals to animats 2: Proceedings of the second international conference on simulation of adaptive behavior, volume 2, page 271. MIT Press, 1993.
-  A. K. McCallum. Learning to use selective attention and short-term memory in sequential tasks. In From animals to animats 4: proceedings of the fourth international conference on simulation of adaptive behavior, volume 4, page 315. MIT Press, 1996.
-  M. A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method. In Machine Learning: ECML 2005, 16th European Conference on Machine Learning, pages 317–328, 2005.
-  O. Tange. GNU Parallel - The Command-Line Power Tool. ;login: The USENIX Magazine, 36(1):42–47, Feb 2011.
-  C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.