Show, Price and Negotiate: A Hierarchical Attention Recurrent Visual Negotiator

by   Amin Parvaneh, et al.
The University of Adelaide

Negotiation, as a seller or buyer, is an essential and complicated aspect of online shopping. It is challenging for an intelligent agent because it requires (1) extracting and utilising information from multiple sources (e.g. photos, texts, and numerals), (2) predicting a suitable price for the products to reach the best possible agreement, (3) expressing the intention conditioned on the price in a natural language and (4) consistent pricing. Conventional dialog systems do not address these problems well. For example, we believe that the price should be the driving factor for the negotiation and understood by the agent. But conventionally, the price was simply treated as a word token i.e. being part of a sentence and sharing the same word embedding space with other words. To that end, we propose our Visual Negotiator that comprises of an end-to-end deep learning model that anticipates an initial agreement price and updates it while generating compelling supporting dialog. For (1), our visual negotiator utilises attention mechanism to extract relevant information from the images and textual description, and feeds the price (and later refined price) as separate important input to several stages of the system, instead of simply being part of a sentence; For (2), we use the attention to learn a price embedding to estimate an initial value; Subsequently, for (3) we generate the supporting dialog in an encoder-decoder fashion that utilises the price embedding. Further, we use a hierarchical recurrent model that learns to refine the price at one level while generating the supporting dialog in another; For (4), this hierarchical model provides consistent pricing. Empirically, we show that our model significantly improves negotiation on the CraigslistBargain dataset, in terms of the agreement price, price consistency, and the language quality.



page 4

page 5

page 8


Factor Graph Attention

Dialog is an effective way to exchange information, but subtle details a...

Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding

Current one-stage methods for visual grounding encode the language query...

Conversational Word Embedding for Retrieval-Based Dialog System

Human conversations contain many types of information, e.g., knowledge, ...

Predicting Airbnb Rental Prices Using Multiple Feature Modalities

Figuring out the price of a listed Airbnb rental is an important and dif...

A Study on Dialog Act Recognition using Character-Level Tokenization

Dialog act recognition is an important step for dialog systems since it ...

Which Kind Is Better in Open-domain Multi-turn Dialog,Hierarchical or Non-hierarchical Models? An Empirical Study

Currently, open-domain generative dialog systems have attracted consider...

A Dynamic Strategy Coach for Effective Negotiation

Negotiation is a complex activity involving strategic reasoning, persuas...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Negotiation is an integral part of the human interactions that the true artificial intelligence must possess the capability for. It is a complex task that requires reasoning about a mutual agreement, understanding the counterpart, appeals to sympathy and uttering convincing arguments. The prevalence of online shopping provides a test bed of negotiation for an artificial intelligence agent that can act on behalf of the human for the most suitable price. This artificial agent has to see the photos of the advertised items, understand the textual content and conduct a dialogue with its counterpart to reach an agreement.

Figure 1: A Visual Negotiator example. Two agents play the role of seller and buyer in a visually-grounded bargain over an item introduced by an image, a free-text title, and a free-text description. The agents need to understand the real value of the item by assessing their image, free-texts and follow a policy which generates human-like sentences and integrated prices.

When negotiation is considered in the context of textual interaction, it is closely related to the goal-oriented dialogue systems (either visual or otherwise) that aim at achieving a particular objective by a conversation between two agents. There are however two major differences: firstly, negotiation is competitive rather than cooperative as most of the existing methods are (see e.g. [9, 4, 22, 5, 28]). Secondly, negotiation is driven by the price and the agent’s anticipation of the true value of a given item that existing goal-oriented dialog systems lack.

DealOrNoDeal [15] pioneered the negotiation in dialogue systems. DealOrNoDeal simplifies negotiation as a game between two agents that have to agree on splitting a set of 3 items. Recently, [10] collected a dataset using negotiations between humans on Craigslist items and used a sequence-to-sequence (Seq-Seq) encoder-decoder models on words and coarse dialogue acts to tackle this problem. They considered treating prices similar to words by forcing the model to learn a semantic representation for them during the training, which is not effective. In both these methods price consistency is a major issue since they are generated based on the correlation with surrounding words rather than the true underlying value of the item that is required for negotiation. For instance, the seller might agree on selling an item for $100 by saying I can do that, while the buyer offers $120 at the next turn, which is in conflict with their agreement. These mistakes, especially when the agent suggests its final offer, can have destructive effects on the result of the negotiation and is undesirable. A better alternative model must produce consistent pricing and reach human-like agreements.

In this paper, we propose the first visual negotiator in which the price, as an important part of the negotiation, is disentangled from the dialog generation. The price is estimated, as opposed to the counterparts, using both the textual and visual content of an item. It is motivated by the fact that the attributes of an item, either visualised in the photos or mentioned in the free-text description, play an undeniable role in a bargain and help the buyer or seller to estimate the price for the item. Our approach is a hierarchical end-to-end model that utilises an attention mechanism to first anticipate an initial price and subsequently refine it through each round of negotiation. In fact, it predicts the next price based on the initial price estimation, dialogue history, and the final prices proposed by agents, instead of traditionally treating the prices as any other words in the vocabulary. The combination of the predicted price and the generated utterance, produced from another level, will be presented as the next utterance. In summary, our approach generates dialogues that are linguistically rich and price-wise consistent (see Figure 1).

Since evaluation of such models are generally challenging, we consider two metrics: (1) we consider the ability of the agent to achieve an agreement that is as similar as possible to the price that the humans agree upon, and, (2) by evaluating the human-likeness of the text based on its similarity in the ground-truth using Intent-BLEU (IBLEU) score, a novel metric for intent similarity measurement between two dialogues. This leads to an unsupervised evaluation as opposed to the ones currently used in the literature.

We tested our proposed model on CraigslistBargain [10]. Our experiments show that not only the language quality of the generated utterances from our approach is better than the baseline methods in terms of human intention similarity and lexical and sentence diversity, but the prices are also closer to human’s agreed prices as well.

In summary, our main contributions are as follows:

  1. We develop an AI agent that can negotiate to sell or buy an item considering its photo, textual title and description, and listing price.

  2. Our model learns to estimate and refine the price for the advertised item by taking into account its visual and textual information and dialog history. In our model, the agent’s anticipation of the price drives the dialog generation as opposed to the existing methods.

  3. Our model generates consistent prices during the dialog, and the price generation is not tangled with the language model.

  4. We propose several new metrics to evaluate negotiation systems objectively. In addition, we remove human involvement in the evaluation thus it is now fully automatic. This enables scalable and commercially viable applications and reduces human bias and inconsistency.

2 Related Work

2.1 Goal-Oriented Dialogue Systems

Goal-oriented dialogue systems have a long history in natural language processing (NLP) community. Recently, it has gained strong interest in computer vision community in the context of visual goal-oriented dialogue systems. For instance, De Veris et al.

[5] introduced a novel problem in computer vision and dialogue systems in which a questioner aims to guess the target object in an image by asking short-answer questions from the answerer. Lee et al. [14] developed another vision-based, goal-oriented dialogue system which consists of answering counting questions about the properties of digits demonstrated on an image. However, since the machine can play just one role (either questioner or answerer) in both applications, they are Visual Question Answering problems by nature rather than two-way, interactive dialogue systems [26].

Goal-oriented dialogue systems can be categorised into collaborative and competitive systems. In a collaborative dialogue environment, agents can help each other to reach a common goal. Applications include trip and accommodation reservation [8, 24], information seeking [22], mutual friend searching [9], navigation [7, 4], and fashion product recommendation [18]. In contrast, in a competitive dialogue environment, agents must negotiate to achieve an agreement based on their individual goals. Their goals are often opposite to each other. For example, a buyer wishes to pay as low as possible, and a seller wishes to sell as high as possible. Dialogues generated by the players in ‘Settlers of Catan’ game, where players negotiate to share their resources, was the first dataset introduced in this context [3]. Recently, Lewis et al. [15] developed the DealOrNoDeal dataset in which the agents negotiate to reach a deal to divide a set of objects. Very recently, He et al. [10] introduced a new negotiation dataset by crawling tangible negotiation scenarios from the Craigslist website. In each scenario, a seller proposes an item by providing images, description, title and a listing price. The seller negotiates with a buyer who tries to buy the item with the lowest possible price. Although our work is built on top of the same dataset, there are significant differences: (1) we propose to use the photos of the item as an important source of knowledge which was neglected in [10]; (2) we aim to estimate and refine the price in a consistent manner, and produce human-like dialogues.

2.2 End-to-End Dialogue Systems

Goal-oriented dialogue systems can be designed in a modular fashion or end-to-end. In a modular fashion, it typically has three main components: (1) natural language understanding (NLU) unit that maps an utterance into semantic slots to be understood and processed by the machine, (2) dialogue manager (DM) which selects the best action according to the output of NLU, and (3) natural language generator (NLG) which produces a meaningful response based on the action chosen by DM, either by looking at a set of possible responses for that action or by using a statistical machine learning language model

[2, 29].

Optimising one component of a modular model at a time can be time-consuming and complicated, as changing one component affects the performance of the others. To overcome or bypass the issues, end-to-end systems have been proposed, and have gain much attention in recent years [25, 1, 6, 17, 16, 23, 27, 7]

. These systems often use an encoder-decoder architecture consisting of an encoder which receives the previous utterance(s) and encode them in a way that the decoder can predict and generate the next utterance. It is notable that each utterance consists of a sequence of words which are mapped to a vector using pre-trained embeddings. In the end-to-end model proposed by He et al.

[10] prices are embedded similar to other words in the utterance. Since the range of the prices are broad and there is not any pre-trained embedding for them, their embedding is learned through the model training. In addition, the generated prices are inconsistent since they were produced based on the correlation with other words rather than the true underlying value of the item. Furthermore, this way of embedding the prices adds more complexity to the model and leads to weaker language model. In this research we show that eliminating the prices from the dictionary of the model, can help the language model to generate better dialogues. We propose to predict the price using a separate neural network that is added to the overall architecture.

3 Visual Negotiator

Figure 2:

IPE produces an initial price for an item by considering its visual and its textual features (in the title and the description). Texts are processed as attention sources after being transferred into a sequence of pre-trained word embeddings (GloVe). Images are fed into a pre-trained object detector (Resnet-101) that produces a fixed sized vector representation. Visual and textual representations are concatenated and fed into a linear transformation to estimate the price.

3.1 Problem Definition

The problem we consider is that of having two agents, namely a seller and a buyer

, negotiating on the price of an item which is identified by an image, textual title and description. The items are classified into various categories as is the common practice in the online shopping websites. The seller advertises an item with a

listing price and most likely agrees to offers closest to this value. The buyer on the other hand has a target price which is lower than the seller’s listing. While the buyers know the listing price, their target price is not revealed to the seller. It should be noted that a negotiation may end without an agreement.

For each advertisement item, there are multiple scenarios that human’s can negotiate over. These scenarios are designed to consider various target prices for the buyer. This leads to completely different dialogues and agreed prices.

Each scenario of this negotiation consists of an advertised item providing context information for the item , where represents the visual cue/feature (i.e. photo) of the item, is the category in which the item has been advertised, is the title of the advertisement, and is the description provided for the item. Additionally, at each dialogue turn , a sequence of utterances in previous turns is available as the dialogue history . It is noticeable that each utterance is a sequence of words (tokens) , and each word is represented as a -dimensional vector.

At each round of negotiation, the agent generates the -th token conditioned on the context information , the dialogue history , and the previously generated tokens at the -th turn for item . The objective is to as closely as possible mimic the behaviour of a human in negotiation. Consequently, the prices agreed upon by an agent has to be as similar as possible to that of the human using convincing arguments.

To that end we propose our Visual Negotiator consisting of two main components: (1) an initial price estimator (IPE) network that determines a base value for an item considering its visual and textual information (, and (2) a hierarchical price-based negotiation model that refines the price and produces the dialogue at each turn using the whole available information.

3.2 Initial Price Estimator

Since in this paper we disentangle the price estimation from the language generation, in the first step we estimate the price using the context information for an item. This allows the agent to have an initial guess of how much an item worth and at what price the agreement can be made prior to the negotiation dialogue. We call this component initial price estimator (IPE). Given the context information of the item , the IPE component predicts a scalar value for the agreement price. This initial estimation is based on both visual features of the item, extracted from its photo

, and its textual features extracted from its category

, title and description . IPE component aims to minimise the difference between the predicted price and the ground-truth real agreed price . The ground-truth real price is calculated as the average of all agreed prices in human-human negotiations over the given item in the dataset. To predict a price we learn a deep neural network parameterised by by minimising the following loss:


The architecture of the IPE network is shown in Figure 2

. As seen, IPE has two major streams for extracting visual as well as textual information from the input. For visual cues, it maps the photo of the item, a tensor of size

, into a -dimensional vector via . For this function , we utilise one of the off-the-shelf pre-trained models ResNet-101 [11] which is then fine-tuned in the minimisation of Eq. 1.

For textual information, we use another function to map the words to a vector in a d-dimensional space (e.g. GloVe [21]). Each sentence, either in the title or description, is then the concatenation of its word embeddings. Subsequently, we utilise an attention mechanism on these embeddings to focus on the important words in the title and description based on the category. We have (with a slight abuse of notation, we use as the concatenation of title and description sentence embeddings):


where is a vector that summarises the title, description and the category using attention. Category in particular carries essential information since the house has a very different price baseline from the furniture.

Finally, the visual and textual representations of the item are concatenated together in a two layer fully connected network for initial price estimation. The -dimensional vector generated from the 2nd last layer produces a price embedding that is fed into the hierarchical price-based negotiation model for later price refinement that will be discussed in the subsequent section.

3.3 Hierarchical Recurrent Price-Based Negotiation Model

Figure 3: Hierarchical recurrent price-based negotiation model. It is an end-to-end model that encodes each utterance in to levels conditioned on the initial price estimation. the encoding results are fed into two separate decoders that have the responsibility of predicting the next utterance (without price) and the next price. The results from two decoders are combined by the utterance builder into a human-like and price-wise consistent utterance.

With the initial price (estimated by IPE), our end-to-end dialogue system generates utterances at each turn based on the current estimated price and the history of the dialogue.

One of the problems in conventional end-to-end Seq-Seq models is that they add price values to the vocabulary and treat them like ordinary words in the dialogue. This deters the intelligent agent from understanding the numerical meaning of the prices, and entangles the strategies for generating words and prices together. As a result, the prices generated in the dialogue, especially at final offering turn, are inconsistent in most cases (see Seq-Seq in Figure 1 for an example).

In our visual negotiator we devise a novel hierarchical recurrent price-based negotiator in which the prices in the utterances are replaced with a fix token (<price>) to be later replaced with the generated ones. Our model encodes utterances in two levels: a word level and a decisions level. At the world level, a recurrent neural network (RNN) is applied to each utterance to find a word-level representation. At the decision level, two RNNs are applied to these word-level representations to find out: (1) what price should be offered, and (2) which sequence of words should be generated. The decision level encodings are conditioned on the initial price estimation.

Our hierarchical recurrent price-based negotiator is comprised of the following components as shown in Figure 3:

(1) Word-level encoder is a RNN that maps the word embedding of -th utterance (a sequence of maximum words) into a -dimensional vector () as the word-level representation of the utterance.

(2) Language-related encoder is a RNN producing a vector representing the history and the context of the dialogue () to anticipate the next utterance.

(3) Price-related encoder is a RNN that receives word-level representation of the previous utterances as the input at turn and maps them to a -dimensional vector (). Since this representation should be conditioned on the initial price estimation resulted from IPE, we feed the initial price embedding into this RNN as the initial hidden state.

(4) Language decoder is a RNN, as a decoder, receiving the output of the language-related encoder as its initial hidden state and generating an output conditioned on its previous hidden state and the previous word (starting from a fix token). In order to force the output to be conditioned on the most important parts of various available information sources at each time step, a global attention mechanism [19] is applied to the outputs of the language decoder. This helps the system to ask or answer questions for different sources including the title, description and the outputs of word-level encoder for previous utterance. At each time step, the attention mechanism learns to find the weighted sum of the information in each source (each information source is a sequence of tokens). Attention mechanism is applied to the output of the decoder just before the generator layer. Thus we have,


where is a the weight vector with a length equal to that of the -th source of information (

). To map the outputs of the end-to-end model to a probability vector of our vocabulary size, a linear function (generative layer) and a

LogSoftmax is applied to the output of the model. With language decoder we find the parameters of the RNN to maximise the likelihood of each word,


(5) Price decoder

is a multilayer perceptron (MLP) to predict the next proposing price

. This prediction should be conditioned on the initial price estimation (), the dialogue history () and the prices currently suggested by both the agent () and the opponent (). Using the initial price embedding as the initial state, the last state of the price encoder () learns a representation of the value and the history of the dialogue. Therefore, it is fed into the price decoder network as the input and will be mapped into a scalar value. We use this RNN to decide on whether the agent has to insist on the current price or not. To optimise the parameters of the RNN for this step, we use the binary cross entropy between the ground-truth’s human decision to insist and the predicted one. The final value proposed by the agent is devised based on the current price, the opponent’s offer and the decision to insist. As such, these three values are considered as the inputs to the last layer of the MLP to predict the price that should be offered in the next turn.

(6) Utterance builder receives the outputs of the language and price decoders and replaces the fix token (<price>) in the generated utterances with the one predicted from the price decoder.

4 Experiments

4.1 Dataset

All the experiments are performed on the CraigslistBargain dataset collected by He et al. [10]. To the best of our knowledge, it is the only dataset publicly available that contains conversations for selling or buying items with photos and free-text descriptions. We only use the scenarios in the dataset with photos (as we are building a visual negotiator), which results in 4,219 training dialogues, 471 evaluation dialogues and 500 test dialogues which are created based on , , and different items respectively.

Table 1

shows the mean, standard deviation and MAD (median absolute deviation) of the real price for each category in the training dataset. Considering the number of samples in the dataset (only 891 training samples) and the diversity of the prices (according to the standard deviation in each category), predicting an initial ideal price for an item by just looking at its image and its free text descriptions is a complicated task.

Category #Samples Mean Std MAD
car 170 $8,684 $7,597 $4,983
Housing 204 $2,128 $1,054 $602
Phone 74 $193 $159 $129
Bike 178 $588 $767 $543
Electronics 121 $164 $266 $153
Furniture 247 $243 $315 $216
All 994 $2,122 $4,432 $2,386
Table 1: Distribution of agreed prices in the training and evaluation sets
Category #Samples Mean Std IPE Div Final Div
Car 23 $7,612 $5,395 $3,887 $378
Housing 27 $1,977 $596 $458 $119
Phone 10 $158 $189 $125 $5
Bike 25 $546 $612 $422 $24
Electronics 14 $88 $93 $69 $3
Furniture 35 $235 $317 $167 $119
All 134 $1,889 $3,522 $898 $91
Table 2: Agreement price results. IPE Div shows the average divergence of the initial pricing (resulted from IPE) from the humans’ agreed prices. These gaps deep into lower values when Visual Negotiators converse to each other o reach an agreement (Final Div).

4.2 Implementation Details

In all the experiments, we use 300-dimensional vectors as the embedding for each word from pre-trained GloVe embedding [21]. All RNNs used as encoder or decoder are 2-layer LSTMs with dropout rate 0.3 and -dimensional hidden states. Parameters of the models are optimised using Adam [13]

with a learning rate of 1e-3 through 40 epochs of optimisation with batches of size 128. After 20 epochs of training, the learning rate is decayed to 1e-4.

In order to extract the features from the images, we utilised the pre-trained Resnet-101 [11], which has shown exceptional performance in various object detection problems. We replaced the final fully connected layer in the network with another fully connected layer to produce a -dimensional vector representing the image features. During the training process, we only fine-tuned the final convolutional layer in the network to achieve better representations for the images according to its agreement price.

In the experiments, the price decoder branch of the hierarchical price-based negotiation model makes a simple binary decision. At each step this component decides either the current proposed price should be altered or the agent should keep the currently proposed price. Specifically, the agent always begins from the listing price if it acts as the seller and of the listing price if it acts as the buyer. Then at each turn, the price predictor decides whether or not to change the current price. If it decides on altering the price, and the generated utterance contains a price token, it decreases (if being a seller) or increases (if being a buyer) the current price. The updated price is simply set to be of either the listing price minus the target for the buyer or listing minus the of list for the seller. It is noticeable that since the ratio of positive (decision to change the price) and negative samples is imbalanced, we implemented the weighted version of Binary Cross Entropy (WBCE) loss which applies more weight on positive samples (three times bigger).

4.3 New Evaluation Metrics

One of the main obstacles of training goal-oriented dialogue systems is that there is no clear performance metric of generated dialogues, and as such qualitative evaluations are conducted by human, which is subjective. We remove human involvement by proposing to run two trained agents against each other; one as a seller and the other as a buyer. For each scenario, the image, title, description, and category of an item are given to both agents. At first, agents make an initial estimation for the price of the item and then generate utterances in a conversation for selling or buying the item. The negotiation is successful if the agents reach an agreement at the end of the dialogue. Conversely, an unsuccessful dialogue incurs when the agents do not reach an agreement over the maximum number of turns, which is 24, or if one side decides to quit the negotiation.

Moreover we have defined various dialogue evaluation metrics which can be categorised into two different groups: (1) metrics that evaluate the language quality (human-likeness) of the generated dialogue, and (2) metrics that evaluate the pricing strategy of the model.

Model IBLEU Dialogue Length Utterance Length Sentence Diversity Vocabulary Diversity
SL(word)[10] 0.36 7 10 0.03035 0.03846
SL(act)+rule[10] 0.20 18 7 0.4984 0.04667
HRED[15] 0.37 10 9 0.3158 0.03362
Visual Negotiator-M 0.34 12 9 0.4594 0.03519
Visual Negotiator 0.42 11 9 0.4272 0.04351
Table 3: Evaluation metrics for language evaluation of the models. indicates higher is better. indicates lower is better.
Category Agreed Price Divergence Price Inconsistency Offer Inconsistency

Price change F-score

SL(word)[10] 16% 6% 6% -
SL(act)+rule[10] 9 % 1% 9% -
HRED[15] 13% 6% 17% -
Visual Negotiator-M 6% 1% 1% 62%
Visual Negotiator 5% 1% 1% 65%
Table 4: Evaluation metrics for pricing policy evaluation of the models. indicates higher is better and indicates lower is better.

4.2.1 Language Metrics. Here we introduce Intent-BLEU (IBLEU), a new metric to measure the similarity of the actions taken by a machine in a machine-machine dialogue with those taken by humans in a human-human dialogue. IBLEU is inspired by BLEU in Machine Translation evaluation [20]. More specifically, we extract the intents of each dialogue turn in a machine generated dialogue using the information retrieval approach introduced by He et al. [10] to create a sequence of intents. And then compare the similarity of the generated sequence of intents with the sequence of intents extracted from human generated dialogue. This is basically done by calculating the modified -gram precision (for a maximum order of 4) of the generated sequence. The higher values of this metric shows the higher level of similarity with human actions.

One of the problems in conventional dialogue systems is that the model repeats the same sentence I can do that.. This is artificial and dull, and should be avoided. We calculate the number of distinct sentences produced by the model and scale them by the total number of sentences as another new metric to show the language quality of the dialogue model. We also calculate the same metric at word-level to show the diversity of the lexical used by the model.

Apart from IBLEU, we applied various word-level and sentence-level metrics to show the richness of the generated dialogues in machine-machine negotiations. Dialogue and utterance length are two metrics that should be considered as language metrics. However, in order to have a better assessment based on these two metrics, other metrics including IBLEU and sentence and lexical diversity should be taken into account. It is because of the fact that a dialogue system which generates long repetitive responses is not a linguistically acceptable system.

Figure 4: An example from the dialogues generated by the visual negotiator.

3.2.2 Pricing Metrics. Two important metrics that measure the mistake ratio of the pricing are price inconsistency and offer inconsistency. Naturally during a negotiation, a buyer would increase the buying price, and a seller would decrease the selling price. When a buyer suggests a price that is lower than the price previously suggested by themselves, it is considered as an inconsistent pricing. Similarly proposing a selling price higher than the last proposed price by the seller is considered as a inconsistent pricing.

We also measure the ratio of offering a wrong price at the end of the negotiation. Since a mistake in offering a price can lead to abandoning the negotiation by the opponent (when the mistake is disadvantageous for the opponent) or causes a loss for the agent (when the mistake is in favour of the opponent), this measure very important.

The similarity of the accepted prices in a machine-machine dialogue with that in human-human dialogue shows how much the agents understands about the real price of the item is similar to human perception. Therefore, we set the divergence of the agreed price from human’s agreed price as another important measure to evaluate the pricing strategy of the model.

4.4 Baseline and Ablation Methods

In order to compare our models with other baseline models, we trained three state-of-the-art methods that treat the prices as words. The first two models are trained to match the models proposed in [10] on CraigslistBargain. Evaluated in a supervised way, these two models generated the most human-like dialogues. The first one is a simple sequence-to-sequence model, SL(word), and the second one is a modular approach (SL(act)+rule) which has applied various human-crafted rules to repeat utterances produced by humans. Additionally, a Hierarchical Recurrent Encoder-Decoder (HRED), which has been widely used as an end-to-end approach for dialogue systems, has been trained as another baseline model.

In addition, we have trained a variation of our visual negotiator model which we call visual negotiator-M. In Visual Negotiator-M, the high-level language-related and price-related encoders are merged into one RNN. In other words, this model tries to use a single representation for the dialogue history for both price decision and utterance generation.

4.5 Results

Figure 5: Examples from two end-to-end models. The left-hand side dialogue is generated from a simple Seq-Seq model which makes a mistake in its offer (from that was agreed upon moves to ), while the right-hand side is produced by the Visual Negotiator. As observed, our approach creates a linguistically more diverse and price-wise reasonable dialogue.

4.5.1 Initial Price Estimation Results.

The attention model proposed for initial price estimation can prognosticate a reasonable agreement price for an item by extracting important features from its images and description. Although a glance at the Mean Absolute Error (MAE) resulted from the model in Table 

2 might show inaccurate price estimation, a deeper insight reveals huge gaps between the MAE values and the Mean Absolute Deviation (MAD) of the prices at each category. It means that the model learns to price the items significantly better than just learning the mean value for each category. Nevertheless, accessing a more samples or pre-training the model based on a large dataset would help the model to better predict the value of each item.

4.5.2 Language Evaluation Results. Table 3 reveals the fact that price elimination from the language vocabulary improves the language quality in end-to-end approaches. Especially, dialogues generated from Visual Negotiator

models enjoy remarkably more language diversity both in word level and sentence level, as the ratio of repetitive sentences, which has been encountered as a common problem in dialogue generation, has increased significantly in both variations of the proposed framework in comparison to SL(word) and HRED models. Additionally, the dialogue and utterance length of the dialogues generated from these models is large enough to show the richness of the generated dialogues. Although it can be inferred from the results that SL(act)+rule is generating linguistically better dialogues as the sentence and vocabulary diversity of this model is larger than the proposed model, it should be mentioned that this diversity is due to heuristic rules that select templates from the dataset that re different from the previously selected ones.

Furthermore, a brief look at the IBELU scores demonstrates the superior performance of Visual Negotiator model in comparison to all other ones. It means that this model acts most similarly to humans in different situations.

4.5.3 Pricing Evaluation Results. Table 4 demonstrates the experimental result of calculating the pricing metrics. It is noticeable that both versions of the visual negotiator models learn to propose consistent prices while maintaining the language quality. Besides, these models almost never make a mistake in offering prices which are in conflict with the prices discussed and agreed during the dialogue.

More importantly, the proposed visual negotiator model understands the suitable agreement price for an item precisely. Table 4 shows remarkable decrease in agreed price divergence (the difference between the prices agreed by the model with those agreed by humans) resulted from visual negotiator in comparison to those from other models. In other words, the proposed model can learn the value of the item first by an initial price estimation based on the photo and description of the item, which might not be accurate enough at first. Then the proposed model can reach agreement on prices very close to those agreed by humans by taking human-like actions both in generating utterances and in proposing prices.

Last but not least, the price change prediction accuracy of visual negotiator model is better than that of its variation visual negotiator-M. Predicting whether or not to change the price at each turn is an essential decision towards reaching an agreement. Visual negotiator model benefits from separate price-related and language-related encoder parameters, thus it makes better decision on whether the price should change or not. Table 4 illuminates an 3% increase in F-score in comparison to visual negotiator-M. Samples from the generated negotiations are provided in Figure 4 and 5.

5 Conclusion and Future Works

In this paper, we proposed a hierarchical attention model for the buyer-seller negotiation. Our model, Visual Negotiator, consists of 1) an attention-based approach that can initially estimate the value of the item conditioned on its visual and textual features, and 2) a hierarchical end-to-end dialogue model that generates an utterance and proposes a price based on the initial pricing. Experiments on CraigslistBargain dataset shows the superior performance of the proposed model linguistically as well as price-wise.

Although the proposed models generate dialogues akin to humans, we believe that there is a long way to build a system that can compete humans in understanding, planning, and following a strong strategy towards its goal. In future we consider improving the current approach by: (1) expanding the dataset to encompass more samples to improve the initial price estimator module; (2) applying reinforcement learning on both language generation and price estimation of the system; and (3) applying pre-trained language models based on transformers, such as BERT

[12], that may improve the understanding and generation performance.


  • [1] Antoine Bordes, Y-Lan Boureau, and Jason Weston. Learning end-to-end goal-oriented dialog. arXiv:1605.07683v4, 2017.
  • [2] Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. A survey on dialogue systems: Recent advances and new frontiers. ACM SIGKDD Explorations Newsletter, 19(2):25–35, 2017.
  • [3] Heriberto Cuayáhuitl, Simon Keizer, and Oliver Lemon. Strategic dialogue management via deep reinforcement learning. arXiv:1511.08099v1, 2015.
  • [4] Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh, Jason Weston, and Douwe Kiela. Talk the walk: Navigating new york city through grounded dialogue. arXiv:1807.03367v3, 2018.
  • [5] Harm de Vries, Florian Strub, A. P. Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron C. Courville. Guesswhat?! visual object discovery through multi-modal dialogue.

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 4466–4475, 2017.
  • [6] Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. Towards end-to-end reinforcement learning of dialogue agents for information access. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 484–495, 2017.
  • [7] Ondřej Dušek and Filip Jurcicek. A context-aware natural language generator for dialogue systems. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 185–190, 2016.
  • [8] Layla El Asri, Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery Fine, Rahul Mehrotra, and Kaheer Suleman. Frames: a corpus for adding memory to goal-oriented dialogue systems. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 207–219, 2017.
  • [9] He He, Anusha Balakrishnan, Mihail Eric, and Percy Liang.

    Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings.

    In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1766–1776, 2017.
  • [10] He He, Derek Chen, Anusha Balakrishnan, and Percy Liang. Decoupling strategy and generation in negotiation dialogues. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2333–2343, 2018.
  • [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [12] Kenton Lee Kristina Toutanova Jacob Devlin, Ming-Wei Chang. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805v1, 2018.
  • [13] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980v9, 2017.
  • [14] Sang-Woo Lee, Youngjoo Heo, and Byoung-Tak Zhang. Answerer in questioner’s mind: Information theoretic approach to goal-oriented visual dialog. In NeurIPS, 2018.
  • [15] Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra. Deal or no deal? end-to-end learning of negotiation dialogues. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2443–2453, 2017.
  • [16] Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2157–2169, 2017.
  • [17] Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. End-to-end task-completion neural dialogue systems. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 733–743, 2017.
  • [18] Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-Seng Chua. Knowledge-aware multimodal dialogue systems. In Proceedings of the 26th ACM international conference on Multimedia, pages 801–809, 2018.
  • [19] Thang Luong, Hieu Pham, and Christopher D. Manning.

    Effective approaches to attention-based neural machine translation.

    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, 2015.
  • [20] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002.
  • [21] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
  • [22] Siva Reddy, Danqi Chen, and Christopher D. Manning. Coqa: A conversational question answering challenge. rXiv:1808.07042v1, 2018.
  • [23] Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 2015.
  • [24] Wei Wei, Quoc Le, Andrew Dai, and Jia Li. Airdialogue: An environment for goal-oriented dialogue research. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3844–3854, 2018.
  • [25] Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M. Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. A network-based end-to-end trainable task-oriented dialogue system. arXiv:1604.04562v3, 2017.
  • [26] Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding, 163:21–40, 2017.
  • [27] Chen Xing, Yu Ping Wu, Wei Chung Wu, Yalou Huang, and Ming Zhou. Hierarchical recurrent attention network for response generation. In AAAI, 2018.
  • [28] Tiancheng Zhao and Maxine Eskenazi. Towards end-to-end learning for dialog state tracking and management using deep reinforcement learning. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 1–10, 2016.
  • [29] Victor W. Zue and James R. Glass. Conversational interfaces: advances and challenges. In Proceedings of the IEEE, pages 1166–1180, Beijing, China, 2000.