1 Introduction
We study the problem of using reinforcement learning (RL) to optimize a controller represented as a recurrent neural network (RNN). RNNs are attractive because they accumulate sequential observations into a latent representation of state, and thus naturally handle partially observable environments, such as dialog, and also robot navigation, autonomous vehicle control, and others.
Among the many methods for RL optimization (Sutton and Barto, 1998), we adopt the policy gradient approach (Williams, 1992)
. Policy gradient approaches are a natural fit for recurrent neural networks because both make updates via stochastic gradient descent. They also have strong convergence characteristics compared to valuefunction methods such as Qlearning, which can diverge when using function approximation
(Precup et al., 2001). Finally, the form of the policy makes it straightforward to also train the model from expert trajectories (ie, training dialogs), which are often available in realworld settings.Despite these advantages, in practice policy gradient methods are often sample inefficient, which is limiting in realworld settings where explorational interactions – ie, conducting dialogs – can be expensive.
The contribution of this paper is to present a family of new methods for increasing the sample efficiency of policy gradient methods, where the policy is represented as a recurrent neural network (RNN). Specifically, we make two changes to the standard policy gradient approach. First, we estimate a
secondRNN which predicts the expected future reward the policy will attain in the current state; during updates, the value network reduces the error (variance) in the gradient step, at the expense of additional computation for maintaining the value network. Second, we add experience replay to both networks, allowing more gradient steps to be taken per dialog.
2 Preliminaries
In a reinforcement learning problem, an agent interacts with a stateful environment to maximize a numeric reward signal. Concretely, at a timestep , the agent takes an action , is awarded a realvalued reward
, and receives an observation vector
. The goal of the agent is to choose actions to maximize the discounted sum of future rewards, called the return, . In an episodic problem, the return at a timestep is:(1) 
where is the terminal timestep, and is a discount factor .
In this paper, we consider policies represented as a recurrent neural network (RNN). Internally the RNN maintains a vector representing a latent state , and the latent state begins in a fixed state . At each timestep , an RNN takes as input an observation vector , updates its internal state according to a differentiable function , and outputs a distribution over actions according to a differentiable function , where parameterize the functions. and
can be chosen to implement long shortterm memory (LSTM)
(Hochreiter and Schmidhuber, 1997)(Cho et al., 2014), or other recurrent (or nonrecurrent) models. denotes the output of the RNN at timestep .Past work has established a principled method for updating the parameters of the policy via RL (Williams, 1992; Sutton et al., 2000; Peters and Schaal, 2006) via stochastic gradient descent:
(2) 
While this update is unbiased, in practice it has high variance and is slow to converge. Williams (1992) and Sutton et al. (2000) showed that this update can be rewritten as
(3) 
where is a baseline, which can be an arbitrary function of states visited in dialog . Note that this update assumes that actions are drawn from the same policy parameterized by – ie, this is an onpolicy update.
Throughout the paper, we also use importance sampling ratios that enable us to perform offpolicy updates. Assume that some behavior policy is used to generate dialogs, and may in general be different from the target policy we wish to optimize. At timestep , we define the importance sampling ratio as
(4) 
3 Methods
3.1 Benchmarks
Before introducing our methods, we first describe our two benchmarks. The first uses (2) directly. The second uses (3), with computed as an estimate of the average return of :
(5) 
where is a window of most recent episodes (dialogs), and the weight of each dialog is . To compute ratios using (4), the policy that generated the data is , and the current (ie, target) policy is .
3.2 Method 1: State value function as baseline
Our first method modifies parameter update (3) by using a pertimestep baseline:
(6) 
where is an estimate of the value of starting from , and parameterizes . This method allows a gradient step to be taken in light of the value of the current state. We implement as a second RNN, and update its parameters at the end of each dialog using supervised learning, as
(7) 
Note that this update is also onpolicy since the policy generating the episode is the same as the policy for which we want to estimate the value.
3.3 Method 2: Experience replay for value network
Method 2 increases learning speed further by reusing past dialogs to better estimate . Since the policy changes after each dialog, past dialogs are offpolicy with respect to , so a correction to (7) is needed. Precup et al. (2001) showed that the following offpolicy update is equal to the onpolicy update, in expectation:
Our second method takes a step with with as the last dialog, and one or more steps with where is sampled from recent dialogs.
3.4 Method 3: Experience replay for policy network
Our third method improves over the second method by applying experience replay to the policy network. Specifically, Degris et al. (2012) shows that samples of the following expectation, which is under behavior policy , can be used to estimate the gradient of the policy network representing the policy :
(8) 
We do not have access to , but since , we can state the following offpolicy update for the policy network:
Method 3 first applies Method 2, then updates the policy network by taking one step with , followed by one or more steps with using samples from recent dialogs.
4 Problem 1: dialog system
To test our approach, we created a dialog system for initiating phone calls to a contact in an address book, taken from the Microsoft internal employee directory. Full details are given in Williams and Zweig (2016); briefly, in the dialog system, there are three entity types – name, phonetype, and yesno. A contact’s name may have synonyms (“Michael”/“’Mike”) and a contact may have more than one phone types (eg “work”, “mobile”) which in turn have synonyms (eg “cell” for “mobile”).
To run large numbers of RL experiments, we then created a stateful goaldirected simulated user, where the goal was sometimes not covered by the dialog system, and where the behavior was stochastic – for example, the simulated user usually answers questions, but can instead ignore the system, provide additional information, or give up. The user simulation was parameterized with around 10 probabilities.
We defined the reward as for successfully completing the task, and otherwise. was used to incentivize the system to complete dialogs faster rather than slower. For the policy network, we defined and to implement an LSTM with 32 hidden units, with a dense output layer with a softmax activation. The value network was identical in structure except it had a single output with a linear activation. We used a batch size of 1, so we update both networks after completion of a single dialog. We used Adadelta with stepsize and . Dialogs took between 3 and 10 timesteps. Every 10 dialogs, the policy was frozen and run for 1000 dialogs to measure average task completion.
5 Problem 2: lunar lander
To test generality and provide for reproducibility, we sought to evaluate on a publicly available dialog task. However, to our knowledge none exists, and so we instead applied our method to a public but nondialog RL task, called “Lunar Lander”, from the OpenAI gym.^{1}^{1}1https://gym.openai.com/envs/LunarLanderv2 This domain has a continuous (fullyobservable) state space in 8 dimensions (eg, xy coordinates, velocities, legtouchdown sensors, etc.), and 4 discrete actions (controlling 3 engines). The reward is +100 for a safe landing in the designated area, and 100 for a crash. Using the engines will also result in a negative cost as explained in the link below. Episodes finish when the spacecraft crashes or lands. We used .
Since this domain is fully observable, we chose definitions of and
in the policy network corresponding to a fully connected neural networks with 2 hidden layers, followed by a softmax normalization. We further chose RELU activations and 16 hidden units, based on limited initial experimentation. The value network has the same architecture except for the output layer that has a single node with linear activation. We used a batch episode size of 10, as we found that with a batch of 1 divergence appears frequently. We used a stepsize of
and used Adam algorithm (Kingma and Ba, 2014) with its default parameters. Methods 2 and 3 performs 5 offpolicy updates per each onpolicy update for the value network. Method 3 performs 3 offpolicy updates per each onpolicy update for the policy network.Results are in Figure 3, and show a similar increase in sample efficiency as in the dialog task.
6 Related work
Since neural networks naturally lend themselves to policy gradientstyle updates, much past work has adopted this broad approach. However, most work has studied the fully observable case, whereas we study the partially observable case. For example, AlphaGo (Silver et al., 2016) applies policy gradients (among other methods), but Go is fully observable via the state of the board.
Several papers that study fullyobservable RL are related to our work in other ways. Degris et al. (2012) investigates offpolicy policy gradient updates, but is limited to linear models. Our use of experience replay is also offpolicy optimization, but we apply (recurrent) neural networks. Like our work, Fatemi et al. (2016) also estimates a value network, uses experience replay to optimize that value network, and evaluates on a conversational system task. However, unlike our work, they do not use experience replay in the policy network, their networks rely on an external state tracking process to render the state fullyobservable, and they learn feedforward networks rather than recurrent networks.
Hausknecht and Stone (2015) applies RNNs to partially observable RL problems, but adopts a Qlearning approach rather than a policy gradient approach. Whereas policy gradient methods have strong convergence properties, Qlearning can diverge, and we observed this when we attempted to optimize Q represented as an LSTM on our dialog problem. Also, a policy network can be pretrained directly from (near) expert trajectories using classical supervised learning, and in realworld applications these trajectories are often available (Williams and Zweig, 2016).
7 Conclusions
We have introduced 3 methods for increasing sample efficiency in policygradient RL. In a dialog task with partially observable state, our best method improved sample efficiency by about a third. On a second fullyobservable task, we observed a similar gain in sample efficiency, despite using a different network architecture, activation function, and optimizer. This result shows that the method is robust to variation in task and network design, and thus it seems promising that it will generalize to other dialog domains as well. In future work we will apply the method to a dialog system with real human users
References
 Cho et al. [2014] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 Chollet [2015] Fran ois Chollet. Keras. https://github.com/fchollet/keras, 2015.
 Degris et al. [2012] Thomas Degris, Martha White, and Richard S Sutton. Offpolicy actorcritic. arXiv preprint arXiv:1205.4839, 2012.
 Fatemi et al. [2016] Mehdi Fatemi, Layla El Asri, Hannes Schulz, Jing He, and Kaheer Suleman. Policy networks with twostage training for dialogue systems. In SIGDIAL2016, 2016.
 Hausknecht and Stone [2015] Matthew J. Hausknecht and Peter Stone. Deep recurrent qlearning for partially observable mdps. CoRR, abs/1507.06527, 2015. URL http://arxiv.org/abs/1507.06527.
 Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jurgen Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735 –1780, 1997.

Jozefowicz et al. [2015]
Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever.
An empirical exploration of recurrent network architectures.
In
Proceedings of the 32nd International Conference on Machine Learning (ICML15)
, pages 2342–2350, 2015.  Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Peters and Schaal [2006] Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2219–2225. IEEE, 2006.
 Precup et al. [2001] Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Offpolicy temporaldifference learning with function approximation. In ICML, pages 417–424, 2001.
 Silver et al. [2016] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Sander Dieleman Marc Lanctot, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529:484 –489, 2016.
 Sutton and Barto [1998] R Sutton and A Barto. Reinforcement Learning: an Introduction. MIT Press, 1998.
 Sutton et al. [2000] Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS) 12, Denver, USA, pages 1057–1063, 2000.
 Theano Development Team [2016] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprints, abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688.
 Williams and Zweig [2016] Jason D. Williams and Geoffrey Zweig. Endtoend lstmbased dialog control optimized with supervised and reinforcement learning. CoRR, abs/1606.01269, 2016. URL http://arxiv.org/abs/1606.01269.
 Williams [1992] Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(23), 1992.
Comments
There are no comments yet.