Deal or No Deal? End-to-End Learning for Negotiation Dialogues

by   Mike Lewis, et al.
Georgia Institute of Technology

Much of human dialogue occurs in semi-cooperative settings, where agents with different goals attempt to agree on common decisions. Negotiations require complex communication and reasoning skills, but success is easy to measure, making this an interesting task for AI. We gather a large dataset of human-human negotiations on a multi-issue bargaining task, where agents who cannot observe each other's reward functions must reach an agreement (or a deal) via natural language dialogue. For the first time, we show it is possible to train end-to-end models for negotiation, which must learn both linguistic and reasoning skills with no annotated dialogue states. We also introduce dialogue rollouts, in which the model plans ahead by simulating possible complete continuations of the conversation, and find that this technique dramatically improves performance. Our code and dataset are publicly available (


MuTual: A Dataset for Multi-Turn Dialogue Reasoning

Non-task oriented dialogue systems have achieved great success in recent...

Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access

This paper proposes KB-InfoBot -- a multi-turn dialogue agent which help...

Dialogue Natural Language Inference

Consistency is a long standing issue faced by dialogue models. In this p...

Recommendation as a Communication Game: Self-Supervised Bot-Play for Goal-oriented Dialogue

Traditional recommendation systems produce static rather than interactiv...

Learning to Plan and Realize Separately for Open-Ended Dialogue Systems

Achieving true human-like ability to conduct a conversation remains an e...

Designing dialogue systems: A mean, grumpy, sarcastic chatbot in the browser

In this work we explore a deep learning-based dialogue system that gener...

Hierarchical Context Enhanced Multi-Domain Dialogue System for Multi-domain Task Completion

Task 1 of the DSTC8-track1 challenge aims to develop an end-to-end multi...

Code Repositories


Deal or No Deal? End-to-End Learning for Negotiation Dialogues

view repo



view repo

1 Introduction

Figure 1: A dialogue in our Mechanical Turk interface, which we used to collect a negotiation dataset.

Intelligent agents often need to cooperate with others who have different goals, and typically use natural language to agree on decisions. Negotiation is simultaneously a linguistic and a reasoning problem, in which an intent must be formulated and then verbally realised. Such dialogues contain both cooperative and adversarial elements, and require agents to understand, plan, and generate utterances to achieve their goals (Traum et al., 2008; Asher et al., 2012).

We collect the first large dataset of natural language negotiations between two people, and show that end-to-end neural models can be trained to negotiate by maximizing the likelihood of human actions. This approach is scalable and domain-independent, but does not model the strategic skills required for negotiating well. We further show that models can be improved by training and decoding to maximize reward instead of likelihood—by training with self-play reinforcement learning, and using rollouts to estimate the expected reward of utterances during decoding.

To study semi-cooperative dialogue, we gather a dataset of 5808 dialogues between humans on a negotiation task. Users were shown a set of items with a value for each, and asked to agree how to divide the items with another user who has a different, unseen, value function (Figure 1).

We first train recurrent neural networks to imitate human actions. We find that models trained to maximise the likelihood of human utterances can generate fluent language, but make comparatively poor negotiators, which are overly willing to compromise. We therefore explore two methods for improving the model’s strategic reasoning skills—both of which attempt to optimise for the agent’s goals, rather than simply imitating humans:

Firstly, instead of training to optimise likelihood, we show that our agents can be considerably improved using self play, in which pre-trained models practice negotiating with each other in order to optimise performance. To avoid the models diverging from human language, we interleave reinforcement learning updates with supervised updates. For the first time, we show that end-to-end dialogue agents trained using reinforcement learning outperform their supervised counterparts in negotiations with humans.

Secondly, we introduce a new form of planning for dialogue called dialogue rollouts, in which an agent simulates complete dialogues during decoding to estimate the reward of utterances. We show that decoding to maximise the reward function (rather than likelihood) significantly improves performance against both humans and machines.

Analysing the performance of our agents, we find evidence of sophisticated negotiation strategies. For example, we find instances of the model feigning interest in a valueless issue, so that it can later ‘compromise’ by conceding it. Deceit is a complex skill that requires hypothesising the other agent’s beliefs, and is learnt relatively late in child development (Talwar and Lee, 2002). Our agents have learnt to deceive without any explicit human design, simply by trying to achieve their goals.

The rest of the paper proceeds as follows: §2 describes the collection of a large dataset of human-human negotiation dialogues. §3 describes a baseline supervised model, which we then show can be improved by goal-based training (§4) and decoding (§5). §6 measures the performance of our models and humans on this task, and §7 gives a detailed analysis and suggests future directions.

2 Data Collection

Crowd Sourced Dialogue

Agent 1 Input 3xbook value=1 2xhat value=3 1xball value=1

Agent 2 Input 3xbook value=2 2xhat value=1 1xball value=2

Dialogue Agent 1: I want the books and the hats, you get the ball Agent 2: Give me a book too and we have a deal Agent 1: Ok, deal Agent 2: <choose>

Agent 1 Output 2xbook 2xhat

Agent 2 Output 1xbook 1xball

Perspective: Agent 1

Perspective: Agent 2

Input 3xbook value=1 2xhat value=3 1xball value=1

Output 2xbook 2xhat

Dialogue write: I want the books and the hats, you get the ball read: Give me a book too and we have a deal write: Ok, deal read: <choose>

Input 3xbook value=2 2xhat value=1 1xball value=2

Dialogue read: I want the books and the hats, you get the ball write: Give me a book too and we have a deal read: Ok, deal write: <choose>

Output 1xbook 1xball
Figure 2: Converting a crowd-sourced dialogue (left) into two training examples (right), from the perspective of each user. The perspectives differ on their input goals, output choice, and in special tokens marking whether a statement was read or written. We train conditional language models to predict the dialogue given the input, and additional models to predict the output given the dialogue.

2.1 Overview

To enable end-to-end training of negotiation agents, we first develop a novel negotiation task and curate a dataset of human-human dialogues for this task. This task and dataset follow our proposed general framework for studying semi-cooperative dialogue. Initially, each agent is shown an input specifying a space of possible actions and a reward function which will score the outcome of the negotiation. Agents then sequentially take turns of either sending natural language messages, or selecting that a final decision has been reached. When one agent selects that an agreement has been made, both agents independently output what they think the agreed decision was. If conflicting decisions are made, both agents are given zero reward.

2.2 Task

Our task is an instance of multi issue bargaining (Fershtman, 1990), and is based on DeVault et al. (2015). Two agents are both shown the same collection of items, and instructed to divide them so that each item assigned to one agent.

Each agent is given a different randomly generated value function, which gives a non-negative value for each item. The value functions are constrained so that: (1) the total value for a user of all items is 10; (2) each item has non-zero value to at least one user; and (3) some items have non-zero value to both users. These constraints enforce that it is not possible for both agents to receive a maximum score, and that no item is worthless to both agents, so the negotiation will be competitive. After 10 turns, we allow agents the option to complete the negotiation with no agreement, which is worth 0 points to both users. We use 3 item types (books, hats, balls), and between 5 and 7 total items in the pool. Figure 1 shows our interface.

2.3 Data Collection

We collected a set of human-human dialogues using Amazon Mechanical Turk. Workers were paid $0.15 per dialogue, with a $0.05 bonus for maximal scores. We only used workers based in the United States with a 95% approval rating and at least 5000 previous HITs. Our data collection interface was adapted from that of Das et al. (2016).

We collected a total of 5808 dialogues, based on 2236 unique scenarios (where a scenario is the available items and values for the two users). We held out a test set of 252 scenarios (526 dialogues). Holding out test scenarios means that models must generalise to new situations.

3 Likelihood Model

Input Encoder

Output Decoder


. Take









(a) Supervised Training

Input Encoder

Output Decoder


. Take









(b) Decoding, and Reinforcement Learning
Figure 3: Our model: tokens are predicted conditioned on previous words and the input, then the output is predicted using attention over the complete dialogue. In supervised training (2(a)), we train the model to predict the tokens of both agents. During decoding and reinforcement learning (2(b)) some tokens are sampled from the model, but some are generated by the other agent and are only encoded by the model.

We propose a simple but effective baseline model for the conversational agent, in which a sequence-to-sequence model is trained to produce the complete dialogue, conditioned on an agent’s input.

3.1 Data Representation

Each dialogue is converted into two training examples, showing the complete conversation from the perspective of each agent. The examples differ on their input goals, output choice, and whether utterances were read or written.

Training examples contain an input goal , specifying the available items and their values, a dialogue , and an output decision specifying which items each agent will receive. Specifically, we represent as a list of six integers corresponding to the count and value of each of the three item types. Dialogue is a list of tokens containing the turns of each agent interleaved with symbols marking whether a turn was written by the agent or their partner, terminating in a special token indicating one agent has marked that an agreement has been made. Output is six integers describing how many of each of the three item types are assigned to each agent. See Figure 2.

3.2 Supervised Learning

We train a sequence-to-sequence network to generate an agent’s perspective of the dialogue conditioned on the agent’s input goals (Figure 2(a)).

The model uses 4 recurrent neural networks, implemented as GRUs (Cho et al., 2014): , , , and .

The agent’s input goals are encoded using . We refer to the final hidden state as . The model then predicts each token from left to right, conditioned on the previous tokens and . At each time step , takes as input the previous hidden state , previous token (embedded with a matrix ), and input encoding . Conditioning on the input at each time step helps the model learn dependencies between language and goals.


The token at each time step is predicted with a softmax, which uses weight tying with the embedding matrix (Mao et al., 2015):


Note that the model predicts both agent’s words, enabling its use as a forward model in Section 5.

At the end of the dialogue, the agent outputs a set of tokens

representing the decision. We generate each output conditionally independently, using a separate classifier for each. The classifiers share bidirectional

and attention mechanism (Bahdanau et al., 2014) over the dialogue, and additionally conditions on the input goals.


The output tokens are predicted using softmax:


The model is trained to minimize the negative log likelihood of the token sequence conditioned on the input goals , and of the outputs conditioned on and

. The two terms are weighted with a hyperparameter



Unlike the Neural Conversational Model (Vinyals and Le, 2015), our approach shares all parameters for reading and generating tokens.

3.3 Decoding

During decoding, the model must generate an output token conditioned on dialogue history and input goals , by sampling from :


If the model generates a special end-of-turn token, it then encodes a series of tokens output by the other agent, until its next turn (Figure 2(b)).

The dialogue ends when either agent outputs a special end-of-dialogue token. The model then outputs a set of choices . We choose each item independently, but enforce consistency by checking the solution is in a feasible set :


In our task, a solution is feasible if each item is assigned to exactly one agent. The space of solutions is small enough to be tractably enumerated.

4 Goal-based Training

read: You get one book and I’ll take everything else.

write: Great deal, thanks!

write: No way, I need all 3 hats

read: Ok, fine

read: I’ll give you 2

read: No problem

read: Any time

choose: 3x hat

choose: 2x hat

choose: 1x book

choose: 1x book





Dialogue history

Candidate responses

Simulation of rest of dialogue

Figure 4: Decoding through rollouts: The model first generates a small set of candidate responses. For each candidate it simulates the future conversation by sampling, and estimates the expected future reward by averaging the scores. The system outputs the candidate with the highest expected reward.

Supervised learning aims to imitate the actions of human users, but does not explicitly attempt to maximise an agent’s goals. Instead, we explore pre-training with supervised learning, and then fine-tuning against the evaluation metric using reinforcement learning. Similar two-stage learning strategies have been used previously (e.g. Li et al. (2016); Das et al. (2017)).

During reinforcement learning, an agent attempts to improve its parameters from conversations with another agent . While the other agent could be a human, in our experiments we used our fixed supervised model that was trained to imitate humans. The second model is fixed as we found that updating the parameters of both agents led to divergence from human language. In effect, agent learns to improve by simulating conversations with the help of a surrogate forward model.

Agent reads its goals and then generates tokens by sampling from . When generates an end-of-turn marker, it then reads in tokens generated by agent . These turns alternate until one agent emits a token ending the dialogue. Both agents then output a decision and collect a reward from the environment (which will be 0 if they output different decisions). We denote the subset of tokens generated by as (e.g. tokens with incoming arrows in Figure 2(b)).

After a complete dialogue has been generated, we update agent ’s parameters based on the outcome of the negotiation. Let be the score agent achieved in the completed dialogue, be the length of the dialogue, be a discount factor that rewards actions at the end of the dialogue more strongly, and be a running average of completed dialogue rewards so far222

As all rewards are non-negative, we instead re-scale them by subtracting the mean reward found during self play. Shifting in this way can reduce the variance of our estimator.

. We define the future reward for an action as follows:


We then optimise the expected reward of each action :


The gradient of is calculated as in REINFORCE (Williams, 1992):


5 Goal-based Decoding

1:procedure Rollout()
3:   for  do candidate moves
5:      do Rollout to end of turn
8:      while 
9:       is candidate move
10:      for  do samples per move
11:          Start rollout from end of
12:         while  do
13:          Rollout to end of dialogue
16:          Calculate rollout output and reward
19:      if  then
21:   return Return best move
Algorithm 1 Dialogue Rollouts algorithm.

Likelihood-based decoding (§3.3) may not be optimal. For instance, an agent may be choosing between accepting an offer, or making a counter offer. The former will often have a higher likelihood under our model, as there are fewer ways to agree than to make another offer, but the latter may lead to a better outcome. Goal-based decoding also allows more complex dialogue strategies. For example, a deceptive utterance is likely to have a low model score (as users were generally honest in the supervised data), but may achieve high reward.

We instead explore decoding by maximising expected reward. We achieve this by using as a forward model for the complete dialogue, and then deterministically computing the reward. Rewards for an utterance are averaged over samples to calculate expected future reward (Figure 4).

We use a two stage process: First, we generate candidate utterances , representing possible complete turns that the agent could make, which are generated by sampling from until the end-of-turn token is reached. Let be current dialogue history. We then calculate the expected reward of candidate utterance by repeatedly sampling from , then choosing the best output using Equation 12, and finally deterministically computing the reward

. The reward is scaled by the probability of the output given the dialogue, because if the agents select different outputs then they both receive 0 reward.


We then return the utterance maximizing .


We use 5 rollouts for each of 10 candidate turns.

6 Experiments

vs. likelihood vs. Human
% Pareto
% Pareto
likelihood 5.4 vs. 5.5 6.2 vs. 6.2 87.9 49.6 4.7 vs. 5.8 6.2 vs. 7.6 76.5 66.2
rl 7.1 vs. 4.2 7.9 vs. 4.7 89.9 58.6 4.3 vs. 5.0 6.4 vs. 7.5 67.3 69.1
rollouts 7.3 vs. 5.1 7.9 vs. 5.5 92.9 63.7 5.2 vs. 5.4 7.1 vs. 7.4 72.1 78.3
rl+rollouts 8.3 vs. 4.2 8.8 vs. 4.5 94.4 74.8 4.6 vs. 4.2 8.0 vs. 7.1 57.2 82.4
Table 1: End task evaluation on heldout scenarios, against the likelihood model and humans from Mechanical Turk. The maximum score is 10. Score (all) gives 0 points when agents failed to agree.
Metric Dataset
Number of Dialogues 5808
Average Turns per Dialogue 6.6
Average Words per Turn 7.6
% Agreed 80.1
Average Score (/10) 6.0
% Pareto Optimal 76.9
Table 2: Statistics on our dataset of crowd-sourced dialogues between humans.

6.1 Training Details

We implement our models using PyTorch. All hyper-parameters were chosen on a development dataset. The input tokens are embedded into a 64-dimensional space, while the dialogue tokens are embedded with 256-dimensional embeddings (with no pre-training). The input

has a hidden layer of size 64 and the dialogue is of size 128. The output and both have a hidden state of size 256, the size of

is 256 as well. During supervised training, we optimise using stochastic gradient descent with a minibatch size of 16, an initial learning rate of 1.0, Nesterov momentum with

=0.1 (Nesterov, 1983), and clipping gradients whose

norm exceeds 0.5. We train the model for 30 epochs and pick the snapshot of the model with the best validation perplexity. We then annealed the learning rate by a factor of 5 each epoch. We weight the terms in the loss function (Equation

3.2) using =0.5. We do not train against output decisions where humans selected different agreements. Tokens occurring fewer than 20 times are replaced with an ‘unknown’ token.

During reinforcement learning, we use a learning rate of 0.1, clip gradients above 1.0, and use a discount factor of =0.95. After every 4 reinforcement learning updates, we make a supervised update with mini-batch size 16 and learning rate 0.5, and we clip gradients at 1.0. We used 4086 simulated conversations.

When sampling words from

, we reduce the variance by doubling the values of logits (i.e. using temperature of 0.5).

6.2 Comparison Systems

We compare the performance of the following: likelihood uses supervised training and decoding (§3), rl is fine-tuned with goal-based self-play (§4), rollouts uses supervised training combined with goal-based decoding using rollouts (§5), and rl+rollouts uses rollouts with a base model trained with reinforcement learning.

6.3 Intrinsic Evaluation

For development, we use measured the perplexity of user generated utterances, conditioned on the input and previous dialogue.

Model Valid PPL Test PPL Test Avg. Rank
likelihood 5.62 5.47 521.8
rl 6.03 5.86 517.6
rollouts - - 844.1
rl+rollouts - - 859.8
Table 3: Intrinsic evaluation showing the average perplexity of tokens and rank of complete turns (out of 2083 unique human messages from the test set). Lower is more human-like for both.

Results are shown in Table 3, and show that the simple likelihood model produces the most human-like responses, and the alternative training and decoding strategies cause a divergence from human language. Note however, that this divergence may not necessarily correspond to lower quality language—it may also indicate different strategic decisions about what to say. Results in §6.4 show all models could converse with humans.

6.4 End-to-End Evaluation

We measure end-to-end performance in dialogues both with the likelihood-based agent and with humans on Mechanical Turk, on held out scenarios.

Humans were told that they were interacting with other humans, as they had been during the collection of our dataset (and few appeared to realize they were in conversation with machines).

We measure the following statistics:
Score: The average score for each agent (which could be a human or model), out of 10.
Agreement: The percentage of dialogues where both agents agreed on the same decision.
Pareto Optimality: The percentage of Pareto optimal solutions for agreed deals (a solution is Pareto optimal if neither agent’s score can be improved without lowering the other’s score). Lower scores indicate inefficient negotiations.

Results are shown in Table 1. Firstly, we see that the rl and rollouts models achieve significantly better results when negotiating with the likelihood model, particularly the rl+rollouts model. The percentage of Pareto optimal solutions also increases, showing a better exploration of the solution space. Compared to human-human negotiations (Table 2), the best models achieve a higher agreement rate, better scores, and similar Pareto efficiency. This result confirms that attempting to maximise reward can outperform simply imitating humans.

Similar trends hold in dialogues with humans, with goal-based reasoning outperforming imitation learning. The

rollouts model achieves comparable scores to its human partners, and the rl+rollouts model actually achieves higher scores. However, we also find significantly more cases of the goal-based models failing to agree a deal with humans—largely a consequence of their more aggressive negotiation tactics (see §7).

7 Analysis

Table 1 shows large gains from goal-based methods. In this section, we explore the strengths and weaknesses of our models.

Goal-based models negotiate harder.

The rl+rollouts model has much longer dialogues with humans than likelihood (7.2 turns vs. 5.3 on average), indicating that the model is accepting deals less quickly, and negotiating harder.

A negative consequence of this more aggressive negotiation strategy is that humans were more likely to walk away with no deal, which is reflected in the lower agreement rates. Even though failing to agree was worth 0 points, people often preferred this course over capitulating to an uncompromising opponent—a factor not well captured by the simulated partner in reinforcement learning training or rollouts (as reflected by the larger gains from goal-based models in dialogues with the likelihood model). In particular, the goal-based models are prone to simply rephrasing the same demand each turn, which is a more effective strategy against the likelihood model than humans. Future work should address this issue.

Figure 5 shows an example of our goal-based model stubbornly negotiating until it achieves a good outcome.

Models learn to be deceptive.

Deception can be an effective negotiation tactic. We found numerous cases of our models initially feigning interest in a valueless item, only to later ‘compromise’ by conceding it. Figure 7 shows an example.

Human 1xbook value=3 1xhat value=1 3xball value=2
rl+rollouts 1xbook value=6 1xhat value=4 3xball value=0
rl+rollouts You can have the balls if i can have the hat and book
Human Nope, give me the books and you can have
everything else
rl+rollouts Ok. I will take the book and one hat
Human I get all the balls?
rl+rollouts Yes. I can do that.
Human Ok, deal
Output Reward
rl+rollouts 1xbook 1xhat 10/10
Human 3xball 6/10
Figure 5: Example of a model stubbornly holding out for maximum points in a dialogue.
Human 3xbook value=2 3xhat value=1 1xball value=1
rl+rollouts 3xbook value=1 3xhat value=2 1xball value=1
Human I’d like the books and the hats.
rl+rollouts I will take the books and hats
Human I really need the books and at least one other object.
rl+rollouts Ok, you can have one book and one ball
Human If I can have all the books, I can leave you the rest.
rl+rollouts Deal
Output Reward
rl+rollouts 3xhat 1xball 7/10
Human 3xbook 6/10
Figure 6: Example of model compromising.

Models produce meaningful novel sentences.

One interesting question is whether our models are capable of generating novel sentences in the new circumstances they find themselves in, or if they simply repeat messages from the training data verbatim. We find that 76% of messages produced by the likelihood model in self-play were found in the training data. We manually examined the novel utterances produced by our model, and found that the overwhelming majority were fluent English sentences in isolation—showing that the model has learnt a good language model for the domain (in addition to results that show it uses language effectively to achieve its goals). These results suggest that although neural models are prone to the safer option of repeating sentences from training data, they are capable of generalising when necessary. Future work should choose domains that force a higher degree of diversity in utterances.

rl+rollouts 4xbook value=0 2xhat value=5 1xball value=0
Human 4xbook value=1 2xhat value=2 1xball value=2
rl+rollouts You can have the ball if I can have everything else.
Human No can’t do that the ball is not going to give
me anything
rl+rollouts I need the hats and you can have the rest
Human Deal
rl+rollouts Great! Thank you!
Output Reward
rl+rollouts 2xhat 10/10
Human 4xbook 1xball 6/10
Figure 7: Dialogue in which the model’s initial interest in the valueless books allows it to compromise while achieving a maximum score.

Maintaining multi-sentence coherence is challenging.

One common linguistic error we see rl+rollouts make is to start a message by indicating agreement (e.g. I agree or Deal), but then going on to propose a counter offer—a behaviour that human partners found frustrating. One explanation is that the model has learnt that in the supervised data, messages beginning with I agree are often at the end of the dialogue, and partners rarely reply with further negotiation—so the models using rollouts and reinforcement learning believe this tactic will help their offer to be accepted.

8 Related Work

Most work on goal orientated dialogue systems has assumed that state representations are annotated in the training data (Williams and Young, 2007; Henderson et al., 2014; Wen et al., 2016). The use of state annotations allows a cleaner separation of the reasoning and natural language aspects of dialogues, but our end-to-end approach makes data collection cheaper and allows tasks where it is unclear how to annotate state. Bordes and Weston (2016) explore end-to-end goal orientated dialogue with a supervised model—we show improvements over supervised learning with goal-based training and decoding. Recently, He et al. (2017) use task-specific rules to combine the task input and dialogue history into a more structured state representation than ours.

Reinforcement learning (RL) has been applied in many dialogue settings. RL has been widely used to improve dialogue managers, which manage transitions between dialogue states (Singh et al., 2002; Pietquin et al., 2011; Rieser and Lemon, 2011; Gašic et al., 2013; Fatemi et al., 2016). In contrast, our end-to-end approach has no explicit dialogue manager. Li et al. (2016) improve metrics such as diversity for non-goal-orientated dialogue using RL, which would make an interesting extension to our work. Das et al. (2017) use reinforcement learning to improve cooperative bot-bot dialogues. RL has also been used to allow agents to invent new languages (Das et al., 2017; Mordatch and Abbeel, 2017). To our knowledge, our model is the first to use RL to improve the performance of an end-to-end goal orientated dialogue system in dialogues with humans.

Work on learning end-to-end dialogues has concentrated on ‘chat’ settings, without explicit goals (Ritter et al., 2011; Vinyals and Le, 2015; Li et al., 2015). These dialogues contain a much greater diversity of vocabulary than our domain, but do not have the challenging adversarial elements. Such models are notoriously hard to evaluate (Liu et al., 2016), because the huge diversity of reasonable responses, whereas our task has a clear objective. Our end-to-end approach would also be much more straightforward to integrate into a general-purpose dialogue agent than one that relied on annotated dialogue states (Dodge et al., 2016).

There is a substantial literature on multi-agent bargaining in game-theory, e.g.

Nash Jr (1950). There has also been computational work on modelling negotiations (Baarslag et al., 2013)—our work differs in that agents communicate in unrestricted natural language, rather than pre-specified symbolic actions, and our focus on improving performance relative to humans rather than other automated systems. Our task is based on that of DeVault et al. (2015), who study natural language negotiations for pedagogical purposes—their version includes speech rather than textual dialogue, and embodied agents, which would make interesting extensions to our work. The only automated natural language negotiations systems we are aware of have first mapped language to domain-specific logical forms, and then focused on choosing the next dialogue act (Rosenfeld et al., 2014; Cuayáhuitl et al., 2015; Keizer et al., 2017). Our end-to-end approach is the first to to learn comprehension, reasoning and generation skills in a domain-independent data driven way.

Our use of a combination of supervised and reinforcement learning for training, and stochastic rollouts for decoding, builds on strategies used in game playing agents such as AlphaGo (Silver et al., 2016). Our work is a step towards real-world applications for these techniques. Our use of rollouts could be extended by choosing the other agent’s responses based on sampling, using Monte Carlo Tree Search (MCTS) (Kocsis and Szepesvári, 2006). However, our setting has a higher branching factor than in domains where MCTS has been successfully applied, such as Go (Silver et al., 2016)—future work should explore scaling tree search to dialogue modelling.

9 Conclusion

We have introduced end-to-end learning of natural language negotiations as a task for AI, arguing that it challenges both linguistic and reasoning skills while having robust evaluation metrics. We gathered a large dataset of human-human negotiations, which contain a variety of interesting tactics. We have shown that it is possible to train dialogue agents end-to-end, but that their ability can be much improved by training and decoding to maximise their goals, rather than likelihood. There remains much potential for future work, particularly in exploring other reasoning strategies, and in improving the diversity of utterances without diverging from human language. We will also explore other negotiation tasks, to investigate whether models can learn to share negotiation strategies across domains.


We would like to thank Luke Zettlemoyer and the anonymous EMNLP reviewers for their insightful comments, and the Mechanical Turk workers who helped us collect data.