1 Introduction
Reinforcement learning was originally developed for Markov Decision Processes (MDPs). It allows an agent to learn a policy to maximize a possibly delayed reward signal in a stochastic environment and guarantees convergence to an optimal policy, provided that the agent can sufficiently experiment and the environment in which it is operating is Markovian.
In many real world problems, however, the agent cannot directly perceive the full state of its environment and must make decisions based on incomplete observations of the system state. This partial observability introduces uncertainty about the true environment state and renders the problem nonMarkovian from the agent’s point of view. One way to deal with partially observable environments is to equip the agent with a memory of past observations and actions in order to help it discover what the current state of the environment is. This memory can be implemented in a variety of ways, including explicit history windows [9, 10]
, but this article only focuses on reinforcement learning using recurrent neural networks for function approximation. Unlike basic feedforward networks, recurrent neural networks can contain cyclic connections between neurons. These cycles give rise to dynamic temporal behavior, which can function as an internal memory that allows these networks to model values associated with sequences of observations
[7, 4, 1]. This paper aims at comparing different recurrent neural architectures when used to model value functions in a reinforcement learning context.The next section provides necessary background on reinforcement learning and the recurrent network architectures compared in this paper. Section 3 describes the experimental setup and environments used for the comparison. The empirical results are provided in Section 4. Finally, we conclude in Section 5.
2 Background
Discretetime reinforcement learning consists of an agent that repeatedly senses observations of its environment and performs actions. After each action , the environment changes state to and the agent receives a reward and an observation . The agent has no knowledge of and and has to interact with its environment in order to learn a policy
, that gives the probability distribution of taking each of the actions for any given observation. The optimal policy
is the one that, when followed by the agent, maximizes the cumulative discounted reward , with .When the reward received by the agent depends solely on its current observation and action, the problem is reduced to a Markov decision process and is said to be completely observable (the agent can assume that without losing learning abilities). Partially observable Markov decision problems occur when the reward does not depend only on , but on state , whose dynamics still obey some underlying MDP, but that the agent cannot observe directly. In this case, , with an unknown oneway function part of the environment.
2.1 QLearning and Advantage Learning
QLearning [13] and Advantage Learning [6] allow an agent to learn a policy that converges to the optimal policy given an infinite amount of time and in discrete domains.
QLearning estimates the
function, that maps each stateaction pair to the expected, optimal cumulative discounted reward reachable by taking action given observation . At each time step, the agent observes , takes action and observes and . Equation 1 is used to update the Q function after each time step, with the learning factor.(1)  
(2) 
Advantage Learning [6] is related to QLearning, but artificially decreases the value of nonoptimal actions. This widens the difference between the value of the optimal action and the other ones, which allows learning to converge more easily even if the values are approximated (using function approximation). Equation 3 is used to update the Advantage values at each time step [1]. The smaller is, the widest the gap between the optimal and nonoptimal actions becomes.
(3)  
(4) 
In very large or even continuous environments, exact representation of the Qfunction (or Advantage function) is no longer possible. In these cases a function approximation architecture is needed to represent the target function. It has been shown, however, that online QLearning can diverge, or converge very slowly, when used in combination with function approximation [11]. One solution to this problem is to learn the values offline. The method used in this paper is the neural fitted Q iteration described in [11], an adaptation of fitted Q iteration [5] using neural networks. The agent interacts with its environment using a fixed policy until reaching the goal or a maximum number time steps have elapsed, and collects samples of the form . After a number of episodes have been run, the model is trained in batch on the collected data. The model maps sequences of observations to action values: .
The next subsections describe the different recurrent network architectures that we consider in this paper to represent the target functions.
2.2 Long Short Term Memory
An LSTM [7] cell stores a value. An output gate allows the cell to modulate its output strength, while an input gate controls the intensity of the input signal that is continuously added to the cell’s content. A forget gate, when set to zero, clears the content of the cell. Equations 5 to 7 show how the values of the gates are computed. Equations 8 and 9 show how to compute the value of the memory cell, and Equation 10 shows the output of an LSTM cell.
(5)  
(6)  
(7)  
(8)  
(9)  
(10) 
In [1]
, Bakker proposes a neural network architecture tailored for reinforcement learning. The network has one input neuron per observation variable, and one output neuron per action. A softmax layer transforms the output of the neural network to a probability distribution over the actions. The neural network itself consists of an LSTM layer and a simple tanh layer working in parallel: the input of the network is fed to the LSTM and tanh layer, both these layers are connected to the output, the output of the tanh layer is connected to the input of the LSTM layer and the output of the LSTM layer is connected to the input of the tanh layer (see Figure
1). This article uses a simpler version of the network: the input is connected to a tanh layer, that is in turn connected to an LSTM layer, that is connected to the output. Both the tanh layer and the LSTM layer contain 100 neurons (or LSTM cells). ^{1}^{1}1The models themselves are built on Keras
http://keras.io/, a Python library providing neural network primitives based on Theano
[2]. Keras provides LSTM, GRU, MUT and dense fullyconnected weighted layers (among others). Layers can be assembled either in a stack or in a directed acyclic graph. The connection scheme in [1] makes the network layer graph cyclic, and hence impossible to build using the current version of Keras.2.3 Gated Recurrent Unit
GRU has been introduced recently and follows a design completely different from LSTM [3, 4]. Instead of storing a value in a memory cell and updating it using input and forget gates, a GRU unit computes a candidate activation based on its input, and then produces an output that is a blend of its past output and the candidate activation. Equations 11 and 12 show how the Z (modulation) and R (reset) gates are computed. Equations 13 and 14 show how the input is mixed with the last activation in order to produce the candidate activation, and Equation 15 shows how the last activation and the candidate activation are mixed to produce the new activation.
(11)  
(12)  
(13)  
(14)  
(15) 
2.4 Mut1
Józefowicz et al. observed that GRU and LSTM are very different from each other, and wondered whether other recurrent neural architectures could be used. In order to discover them, they developed a genetic algorithm that evaluated thousands of recurrent neural architectures. Once the experiment was finished, they identified three architectures that performed as good as or better than LSTM and GRU on their test vectors: MUT1, MUT2 and MUT3
[3].This paper only considers MUT1, that produced the best results on preliminary experiments. Equations 16 and 17 show to compute the value of the Z and R gates, Equations 18 and 19 show how to compute the candidate activation, and Equation 20 shows that the output of the MUT1 uses the same type of mixing as the one used by GRU.
(16)  
(17)  
(18)  
(19)  
(20) 
3 Experimental Setup
In order to keep training time manageable, the neural networks are trained to associate values with the last 10 observations, instead of the complete history. LSTM, GRU and MUT1 are able to associate values to arbitrarily long sequences of inputs, but Keras requires all the sequences on which it is trained to have the same length (possibly with padding).
Training has to be done carefully, because one does not want the model to forget past experiences when a new batch of episodes is learned. The network has been configured to perform 2 training epochs on the data, using a batch size of 10 (batches of 10
samples are used to compute an average gradient when performing backpropagation). The small number of epochs prevents the model from overfitting specific episodes.
3.1 Environments
Three environments are used to evaluate the neural network models. The first one is a simple fullyobservable grid world with the initial position at , the goal at and an obstacle at (see Figure 2). The agent can observe its coordinates. It receives a reward of at each time step, if it hits a wall or the obstacle, and when it reaches the goal.
The second environment is based on the same grid world as the first one, but the agent can only observe its coordinate. The coordinate is masked to zero.
The last environment is also based on the grid world, but the agent can only observe its orientation (whether it is facing up, down, left or right, expressed as a 0 to 3 integer number) and the distance between it and the wall in front of it. This agentcentric environment is very close to what actual robots can experience.
The “stochastic” variant of the experiments uses a random initial position for every episode. The agent can sense its initial coordinate at the first time step, even in otherwise partially observable environments^{2}^{2}2Some experiments have been rerun without this hint, with no change in the results. The agent learns to look left, then up, and uses those observations as initial position..
The observations of the agent, that consist of integer numbers, are encoded using a onehot encoding so that they are more easily processed by neural networks. For instance, the y coordinate of the grid world can take values from 0 to 4, which are encoded as
. For the grid world, the neural networks therefore have 15 input neurons.3.2 Experiments
Each experiment consists of 5000 episodes of a maximum of 500 time steps. During the episodes, the neural network is not trained on any new data, but values are computed based on and stored in a list. After every batch of 10 episodes, the neural networks are trained on the values, as described in [11] and shown in Figure 3.
The experiments themselves consist of trying to reach the goal in one of the environments described in Section 3.1. Each experiment is run 15 times for each combination of the following parameters:

Value iteration: QLearning and Advantage learning, , and

Neural network architecture: feedforward perceptron with a single hidden layer (nnet), LSTM (lstm), GRU (gru) and MUT1 (mut1)

World: gridworld (gw), partially observable gridworld (po) and agentcentric gridworld (ac)

Fixed initial position and random initial position

Softmax action selection with a temperature of 0.5
4 Empirical Results
Each experiment (see Section 3.2) is run 15 times. The first time step at which the agent is able to maintain an average (over the 1000 next time steps) reward of more than
with a standard deviation less than 20 is called the
learning time. The best average reward obtained during a 1000timesteps window is called the learning performance.








Table 1 shows the learning time of the different neural networks for all the experiment configurations. Best results are emphasized in bold. Results are displayed in a mean/stddev format.
Advantage Learning leads to smaller learning times and standard deviations than QLearning in all worlds except the partially observable grid world. Figure 4 shows the behavior of the Q and Advantage learning algorithms in the partially observable grid world. QLearning allows faster convergence with a smaller standard deviation.
When using a fixed initial position, GRU learns faster than any other network. The difference of learning speed between GRU and LSTM is statistically significant for (gw, Advantage), (gw, QLearning), (po, QLearning) and (ac, QLearning) (pvalues of 0.003, 0.008, 0.0003 and 0.0003, respectively), but not for (po, Advantage) and (ac, Advantage) (pvalues of 0.118 and 0.140, respectively).
When using a random initial position, GRU is the only model allowing learning in all the environments when Advantage Learning is used. LSTM and GRU give comparable results in the partially observable worlds, with no statistically significant difference between them.
Agents using MUT1 as a function approximator nearly always manage to learn a good enough policy in partially observable worlds, but they need a large number of episodes to do so. However, plain perceptronbased agents don’t manage at all to learn a policy in these worlds^{3}^{3}3Except in the partially observable grid world using QLearning and random initial positions, where the agent learns to go left, then randomly go up and down until the goal is reached by chance., which shows that MUT1 allows better learning in partially observable worlds than a simple nonrecurrent neural network.
Table 2 shows the learning performance of the different neural networks, with the highest values highlighted in bold. Results are displayed in a mean/stddev format.
The feedforward neural network always achieves the best scores in the grid world, followed by GRU, then LSTM, and finally MUT1. GRU always outperforms the other network architectures in the partially observable worlds. In these worlds, GRU is statistically significantly better than LSTM in all cases (pvalue less than 0.0001) except in (random initial, ac, QLearning) (pvalue of 0.682).
5 Conclusion
LSTM, GRU and MUT1 have been compared on simple reinforcement learning problems. It has been shown that agents using LSTM and GRU for approximating Q or Advantage values perform significantly better than the ones using MUT1, obtaining higher rewards and learning faster.
GRU and LSTM provide comparable performance, with GRU often being significantly better than LSTM. LSTM is never significantly better than GRU. When considering the rewards received by the agents once they have learned, and not the time required for learning, GRU always achieves better results than LSTM.
This shows that using GRU instead of LSTM should be considered when tackling reinforcement problems. Moreover, on the machine used for the experiments, the simpler GRU cell (compared to LSTM) allowed the GRUbased agents to complete their 5000 episodes approximately two times faster than LSTMbased agents.
References
 [1] B. Bakker. Reinforcement Learning with Long ShortTerm Memory. In Advances in Neural Information Processing Systems 14, pages 1475–1482, 2001.
 [2] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. WardeFarley, and Y. Bengio. Theano: a CPU and GPU Math Expression Compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), 2010.

[3]
K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau,
F. Bougares, H. Schwenk, and Y. Bengio.
Learning Phrase Representations using RNN EncoderDecoder for
Statistical Machine Translation.
In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
, pages 1724–1734, 2014.  [4] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR, abs/1412.3555, 2014.

[5]
D. Ernst, P. Geurts, and L. Wehenkel.
TreeBased Batch Mode Reinforcement Learning.
Journal of Machine Learning Research
, 6:503–556, 2005.  [6] M. E. Harmon and L. C. Baird III. Multiplayer residual advantage learning with general function approximation. Wright Laboratory, WL/AACF, WrightPatterson Air Force Base, OH, 1996.
 [7] S. Hochreiter and J. Schmidhuber. Long ShortTerm Memory. Neural Computation, 9(8):1735–1780, 1997.
 [8] R. Józefowicz, W. Zaremba, and I. Sutskever. An Empirical Exploration of Recurrent Network Architectures. In Proceedings of the 32nd International Conference on Machine Learning, pages 2342–2350, 2015.
 [9] L.J. Lin and T. Mitchell. Reinforcement Learning with Hidden States. In From animals to animats 2: Proceedings of the second international conference on simulation of adaptive behavior, volume 2, page 271. MIT Press, 1993.
 [10] A. K. McCallum. Learning to use selective attention and shortterm memory in sequential tasks. In From animals to animats 4: proceedings of the fourth international conference on simulation of adaptive behavior, volume 4, page 315. MIT Press, 1996.
 [11] M. A. Riedmiller. Neural Fitted Q Iteration  First Experiences with a Data Efficient Neural Reinforcement Learning Method. In Machine Learning: ECML 2005, 16th European Conference on Machine Learning, pages 317–328, 2005.
 [12] O. Tange. GNU Parallel  The CommandLine Power Tool. ;login: The USENIX Magazine, 36(1):42–47, Feb 2011.
 [13] C. J. Watkins and P. Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
Comments
There are no comments yet.