Sample-efficient Deep Reinforcement Learning for Dialog Control

12/18/2016 ∙ by Kavosh Asadi, et al. ∙ Brown University Microsoft 0

Representing a dialog policy as a recurrent neural network (RNN) is attractive because it handles partial observability, infers a latent representation of state, and can be optimized with supervised learning (SL) or reinforcement learning (RL). For RL, a policy gradient approach is natural, but is sample inefficient. In this paper, we present 3 methods for reducing the number of dialogs required to optimize an RNN-based dialog policy with RL. The key idea is to maintain a second RNN which predicts the value of the current policy, and to apply experience replay to both networks. On two tasks, these methods reduce the number of dialogs/episodes required by about a third, vs. standard policy gradient methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study the problem of using reinforcement learning (RL) to optimize a controller represented as a recurrent neural network (RNN). RNNs are attractive because they accumulate sequential observations into a latent representation of state, and thus naturally handle partially observable environments, such as dialog, and also robot navigation, autonomous vehicle control, and others.

Among the many methods for RL optimization (Sutton and Barto, 1998), we adopt the policy gradient approach (Williams, 1992)

. Policy gradient approaches are a natural fit for recurrent neural networks because both make updates via stochastic gradient descent. They also have strong convergence characteristics compared to value-function methods such as Q-learning, which can diverge when using function approximation

(Precup et al., 2001). Finally, the form of the policy makes it straightforward to also train the model from expert trajectories (ie, training dialogs), which are often available in real-world settings.

Despite these advantages, in practice policy gradient methods are often sample inefficient, which is limiting in real-world settings where explorational interactions – ie, conducting dialogs – can be expensive.

The contribution of this paper is to present a family of new methods for increasing the sample efficiency of policy gradient methods, where the policy is represented as a recurrent neural network (RNN). Specifically, we make two changes to the standard policy gradient approach. First, we estimate a


RNN which predicts the expected future reward the policy will attain in the current state; during updates, the value network reduces the error (variance) in the gradient step, at the expense of additional computation for maintaining the value network. Second, we add experience replay to both networks, allowing more gradient steps to be taken per dialog.

This paper is organized as follows. The next section reviews the policy gradient approach, Section 3 presents our methods, Sections 4 and 5 present results on two tasks, Section 6 covers related work, and Section 7 briefly concludes.

2 Preliminaries

In a reinforcement learning problem, an agent interacts with a stateful environment to maximize a numeric reward signal. Concretely, at a timestep , the agent takes an action , is awarded a real-valued reward

, and receives an observation vector

. The goal of the agent is to choose actions to maximize the discounted sum of future rewards, called the return, . In an episodic problem, the return at a timestep is:


where is the terminal timestep, and is a discount factor .

In this paper, we consider policies represented as a recurrent neural network (RNN). Internally the RNN maintains a vector representing a latent state , and the latent state begins in a fixed state . At each timestep , an RNN takes as input an observation vector , updates its internal state according to a differentiable function , and outputs a distribution over actions according to a differentiable function , where parameterize the functions. and

can be chosen to implement long short-term memory (LSTM)

(Hochreiter and Schmidhuber, 1997)

, gated recurrent unit

(Cho et al., 2014), or other recurrent (or non-recurrent) models. denotes the output of the RNN at timestep .

Past work has established a principled method for updating the parameters of the policy via RL (Williams, 1992; Sutton et al., 2000; Peters and Schaal, 2006) via stochastic gradient descent:


While this update is unbiased, in practice it has high variance and is slow to converge. Williams (1992) and Sutton et al. (2000) showed that this update can be re-written as


where is a baseline, which can be an arbitrary function of states visited in dialog . Note that this update assumes that actions are drawn from the same policy parameterized by – ie, this is an on-policy update.

Throughout the paper, we also use importance sampling ratios that enable us to perform off-policy updates. Assume that some behavior policy is used to generate dialogs, and may in general be different from the target policy we wish to optimize. At timestep , we define the importance sampling ratio as


3 Methods

3.1 Benchmarks

Before introducing our methods, we first describe our two benchmarks. The first uses (2) directly. The second uses (3), with computed as an estimate of the average return of :


where is a window of most recent episodes (dialogs), and the weight of each dialog is . To compute ratios using (4), the policy that generated the data is , and the current (ie, target) policy is .

3.2 Method 1: State value function as baseline

Our first method modifies parameter update (3) by using a per-timestep baseline:


where is an estimate of the value of starting from , and parameterizes . This method allows a gradient step to be taken in light of the value of the current state. We implement as a second RNN, and update its parameters at the end of each dialog using supervised learning, as


Note that this update is also on-policy since the policy generating the episode is the same as the policy for which we want to estimate the value.

3.3 Method 2: Experience replay for value network

Method 2 increases learning speed further by reusing past dialogs to better estimate . Since the policy changes after each dialog, past dialogs are off-policy with respect to , so a correction to (7) is needed. Precup et al. (2001) showed that the following off-policy update is equal to the on-policy update, in expectation:

Our second method takes a step with with as the last dialog, and one or more steps with where is sampled from recent dialogs.

3.4 Method 3: Experience replay for policy network

Our third method improves over the second method by applying experience replay to the policy network. Specifically, Degris et al. (2012) shows that samples of the following expectation, which is under behavior policy , can be used to estimate the gradient of the policy network representing the policy :


We do not have access to , but since , we can state the following off-policy update for the policy network:

Method 3 first applies Method 2, then updates the policy network by taking one step with , followed by one or more steps with using samples from recent dialogs.

4 Problem 1: dialog system

To test our approach, we created a dialog system for initiating phone calls to a contact in an address book, taken from the Microsoft internal employee directory. Full details are given in Williams and Zweig (2016); briefly, in the dialog system, there are three entity types – name, phonetype, and yesno. A contact’s name may have synonyms (“Michael”/“’Mike”) and a contact may have more than one phone types (eg “work”, “mobile”) which in turn have synonyms (eg “cell” for “mobile”).

To run large numbers of RL experiments, we then created a stateful goal-directed simulated user, where the goal was sometimes not covered by the dialog system, and where the behavior was stochastic – for example, the simulated user usually answers questions, but can instead ignore the system, provide additional information, or give up. The user simulation was parameterized with around 10 probabilities.

We defined the reward as for successfully completing the task, and otherwise. was used to incentivize the system to complete dialogs faster rather than slower. For the policy network, we defined and to implement an LSTM with 32 hidden units, with a dense output layer with a softmax activation. The value network was identical in structure except it had a single output with a linear activation. We used a batch size of 1, so we update both networks after completion of a single dialog. We used Adadelta with stepsize and . Dialogs took between 3 and 10 timesteps. Every 10 dialogs, the policy was frozen and run for 1000 dialogs to measure average task completion.

Figure 1: Number of dialogs vs. average task success over 200 runs for the dialog task.
Figure 2: Number of dialogs vs. variance in task success among 200 runs.

Figures 1 and 2 show mean and variance for task completion over 200 independent runs. Compared to the benchmarks, our methods require about one third fewer dialogs to attain asymptotic performance, and have lower variance.

5 Problem 2: lunar lander

To test generality and provide for reproducibility, we sought to evaluate on a publicly available dialog task. However, to our knowledge none exists, and so we instead applied our method to a public but non-dialog RL task, called “Lunar Lander”, from the OpenAI gym.111 This domain has a continuous (fully-observable) state space in 8 dimensions (eg, x-y coordinates, velocities, leg-touchdown sensors, etc.), and 4 discrete actions (controlling 3 engines). The reward is +100 for a safe landing in the designated area, and -100 for a crash. Using the engines will also result in a negative cost as explained in the link below. Episodes finish when the spacecraft crashes or lands. We used .

Since this domain is fully observable, we chose definitions of and

in the policy network corresponding to a fully connected neural networks with 2 hidden layers, followed by a softmax normalization. We further chose RELU activations and 16 hidden units, based on limited initial experimentation. The value network has the same architecture except for the output layer that has a single node with linear activation. We used a batch episode size of 10, as we found that with a batch of 1 divergence appears frequently. We used a stepsize of

and used Adam algorithm (Kingma and Ba, 2014) with its default parameters. Methods 2 and 3 performs 5 off-policy updates per each on-policy update for the value network. Method 3 performs 3 off-policy updates per each on-policy update for the policy network.

Figure 3: Number of epsidoes vs. average return over 200 runs for the lunar lander task.

Results are in Figure 3, and show a similar increase in sample efficiency as in the dialog task.

6 Related work

Since neural networks naturally lend themselves to policy gradient-style updates, much past work has adopted this broad approach. However, most work has studied the fully observable case, whereas we study the partially observable case. For example, AlphaGo (Silver et al., 2016) applies policy gradients (among other methods), but Go is fully observable via the state of the board.

Several papers that study fully-observable RL are related to our work in other ways. Degris et al. (2012) investigates off-policy policy gradient updates, but is limited to linear models. Our use of experience replay is also off-policy optimization, but we apply (recurrent) neural networks. Like our work, Fatemi et al. (2016) also estimates a value network, uses experience replay to optimize that value network, and evaluates on a conversational system task. However, unlike our work, they do not use experience replay in the policy network, their networks rely on an external state tracking process to render the state fully-observable, and they learn feed-forward networks rather than recurrent networks.

Hausknecht and Stone (2015) applies RNNs to partially observable RL problems, but adopts a Q-learning approach rather than a policy gradient approach. Whereas policy gradient methods have strong convergence properties, Q-learning can diverge, and we observed this when we attempted to optimize Q represented as an LSTM on our dialog problem. Also, a policy network can be pre-trained directly from (near-) expert trajectories using classical supervised learning, and in real-world applications these trajectories are often available (Williams and Zweig, 2016).

7 Conclusions

We have introduced 3 methods for increasing sample efficiency in policy-gradient RL. In a dialog task with partially observable state, our best method improved sample efficiency by about a third. On a second fully-observable task, we observed a similar gain in sample efficiency, despite using a different network architecture, activation function, and optimizer. This result shows that the method is robust to variation in task and network design, and thus it seems promising that it will generalize to other dialog domains as well. In future work we will apply the method to a dialog system with real human users