Millions of shoppers contact Amazon’s customer service department every year, where customers may choose between telephone, online chat, or email channels. Most customers will contact Amazon via telephone, which is an especially labor-intensive form of communication. The need for agent labor is highly seasonal, and hiring more agents requires significant ramp-up time for training. Furthermore, Amazon’s order volume increases significantly year-over-year, which makes scaling customer service sub-linearly with order volume especially crucial.
Machine learning and dialogue generation provide an opportunity to make existing agents more efficient, and may allow for the total automation of issue resolution (at least for a select subset of issues.) To that end, we present the first steps towards a practical dialogue system for the customer service domain at Amazon. In this work, we focus solely on the response generation module of such a system: given a customer inquiry, generate the text of the response that most likely answers the question asked. An effective dialogue system would automate the handling of a large percentage of customer interactions, potentially generating significant savings in labor costs, reducing the perceived response times for customers, and allowing customer service to scale better with increasing demand.
For well-established domains like customer service, large human-generated corpuses already exist. Indeed, Amazon’s internal online chat corpus is a rich source of data for building a response generation model, independently of transcribed speech from past phone calls. Amazon’s chat corpus also contains customer-selected issue labels, order-related entities, etc. We will use this corpus for model training and experiments.
While open domain dialogue generation remains a topic of ongoing research, we hypothesize that in closed domains like our own, a finite set of response templates would cover the vast majority of interactions. In fact, customer service agents across Amazon already use a collection of ‘blurbs’ which they copy and paste as replies. However, these blurbs are not centrally managed, can be particular to each agent, are unannotated, and number in the thousands. In practice, due to the overhead of searching for the right blurb, each agent only uses a handful of blurbs regularly.
Therefore, we approach response generation as a two-fold problem: determining the templates that should be created based on past agent replies to customer questions, and choosing the correct template as the response to an inquiry.
A template-based approach addresses some issues that affect existing dialogue generation systems, namely relevance, text quality, diversity, and (for goal-oriented systems) the need for annotated customer intents. The templates that we extract can be filtered for high relevance and specificity, corrected for consistency of tone, and enriched with the addition of slots for customer profile metadata and other forms of context. Furthermore, the use of fixed templates allows us to better tune text-to-speech systems to produce more natural-sounding speech.
We built a prototype response generation model based on online chats between customers and agents, and evaluated it against an random sample of past chat conversations. We showed that the selected templates cover a large portion of past customer inquiries, and that human evaluators (who are customer service agents) preferred the model-selected templates to the templates retrieved by a tf-idf baseline. The system is not yet customer-facing, and we conclude by discussing some of the work remaining.
2 Related Work
Recently, deep learning-based systems for question answering and dialogue have been the focus of both academic and industrial research. In dialogue systems, Vinyals et al and Serban et al  demonstrated that encoder-decoder networks with LSTM units can generate dialogue based on IT help desk and movie script corpuses. For question answering problems, Sukhbaatar et al  are able to achieve competitive performance on the so-called bAbI tasks  with memory networks and limited supervision . Last year, Google launched Smart Reply, an email response recommendation system that recommends short replies for 10% of Gmail volume in their Inbox mobile application . Google’s new messaging app Allo also uses same technology to recommend responses for mobile chats.
In this paper, we applied a Siamese-like network  with 2 encoders to build a response generation system for a subset of customer service chats related to item delivery problems. (In particular, we selected chat contacts where the customer indicated a “Where’s My Stuff?” issue.) In the context of information retrieval, Lowe et al  also used a similar network to retrieve the next reply from a corpus of Ubuntu technical help IRC chats.
Our approach is most similar to that described in Smart Reply, but with certain differences. Firstly, while Smart Reply and our system both generate replies from a fixed set of templates, we do not need to perform beam search or generate the text directly. This simplifies the engineering effort required for deployment and speeds up response generation, since our responses can be longer than the 10 token limit that the Inbox UI suggests. Secondly, while we both use clustering techniques and manual inspection to extract an initial set of templates, we perform the clustering in a fully automatic fashion, without the need for intent clusters initialized with human expertise. Clustering around pre-specified intents is important for an open domain corpus like emails, since there would be a huge number of topic clusters in the dataset, whereas in closed domains this is less important.
The paper is organized as follows: Section 3.2 provides details about the dual encoder network model. Section 3.3 discusses how we create the pool of template answers that our dual encoder network model selects from. Section 4 presents evaluations of our system and Section 5 discusses future improvements.
3 Models for Response Generation
We used one year of Amazon customer service chat transcripts on item delivery issues from 09/2015 - 09/2016 for creating training data. The raw text is split into agent and customer turns, tokenized, filtered for sensitive customer information (e.g. names, credit card numbers, etc.), and converted to lowercase.
The data needed for training the dual encoder network are pairs of customer questions and agent responses with binary labels of whether or not they are a match.
To extract meaningful question-answer pairs, we select every customer turn in the conversation that ends with a question mark; the agent turn after it is considered the correct reply. These matching pairs constitute our positive samples.
To create non-matching pairs (i.e. negative samples), we use the same set of customer questions, but for each question, we randomly select an agent turn that follows some other customer question in the corpus. We created 3.3 million training samples with a positive to negative ratio of 1:2 in our training dataset. This dataset was extracted from a small fraction of the total contact volume we handled that year. Table 1 shows some positive and negative examples from our training data.
|Customer Inquiry||Agent Response||Label|
|and will i be sent an email ?||yes , NAME .||1|
|can the ship speed be changed ?||yes , i ’ve already upgraded .||1|
|ok so what i have to do now ?||it ’s a good company to work for||0|
|can I ask for a resend ?||both the orders will be delivered to you today .||0|
3.2 Dual Encoder Network
Figure 1 shows the schematic for the network.
Our dual encoder network takes a customer question (e.g. “Will I receive a new tracking number?”) and an answer (e.g. “Yes we’ll have it emailed to you.”) as input. The question and answer are fed into two separate LSTM 
encoders. The encoders generate low dimensional embeddings for the question and the answer. The embeddings are then concatenated and passed to a multi-layer perceptron (MLP) which outputs the probability that the question and answer match.
LSTM networks have been widely used for encoding sentences into low dimensional embeddings for various NLP-related tasks. 
showed that LSTM’s achieved state-of-the-art performance on various sequential classification tasks. Presently, LSTM-based classifiers are standard baselines for text classification tasks. applied LSTM’s to create sentence embeddings for machine translation.  and 
showed that LSTM-based embeddings can be used for transfer learning across diverse tasks, including semantic relatedness, paraphrase extraction, and information retrieval. In this work, we used LSTM’s for encoding question and answer sentences, as shown in Figure1. At time , the word is mapped to a embedding and then fed into the LSTM one at a time, updating the hidden state of the LSTM. The hidden state of the LSTM at the last time step is used as the embedding of the entire sentence.
. Among the hyperparameter combinations we tried, the optimal error on the development set was obtained with the hyperparameters listed in Table2. We used Adam  to perform the stochastic optimization of the network parameters. The network has a total of 5 million parameters.
|Word embedding dim||512|
|LSTM output layer dim||512|
|# of MLP layers||3|
|MLP hidden dim.||3|
# of epochs
This model achieves accuracy on the development set, where the positive to negative ratio is also 1:2.
3.3 Response Template Extraction and Prediction
The other key idea of the system is the pool of pre-constructed answer templates. An ideal pool would contain all of the common agent responses on item delivery issues; if the appropriate answers are not in the pool, then the system cannot recommend a reasonable answer. On the other hand, the pool size can’t be too large due to the computational cost. While a pool of 10k randomly sampled agent answers will cover almost all common questions on item delivery issues at prediction time, the dual encoder network would have to score 10k question-answer pairs for each input customer question.
As the first step, we randomly sampled 400k agent answers from historical item delivery-related chats, and generated embeddings for them by using the trained answer encoder (Figure 1). Our analysis shows the embeddings are able to capture semantic similarity beyond simple vocabulary overlap (Table 3
). The answer templates are selected with the help of K-means clustering. We applied mini-batch k-means with k-means++ initializations[16, 17] to cluster the 400k answer embeddings into 500 clusters. To form the template for each cluster, we take the text of the agent answer with embedding closest to the cluster center. Finally, we created a pool of 200 answer templates by human review.
At prediction time, the system will pair the customer question with every pre-constructed answer template, and use the trained dual encoder network to produce a measure of how well each answer matches the question. The system will then recommend the top-k answers to the agents ranked by this probability. Note that the answer embeddings for the full set of the answer templates are precomputed and stored for computational efficiency.
A more straightforward approach to performing dialogue response generation would essentially be a supervised text classification task. Based on the customer intent predicted by the model, the system can present the customer with some pre-determined response. A common example of this would be pre-written dialogue combined with state tracking, which is used in IVR systems in travel and restaurant reservation applications . However, even for something as simple as item delivery issues, the total number of possible types of customer questions can be close to 100-200, and extending such an approach to all of the domains in customer service (e.g. Kindle content, Amazon Instant Video issues, Prime subscription issues, etc.) would be impractical. In contrast, our approach has the benefit of not needing annotated data. The only data needed to train our model are customer questions and the agent answers after them, which already exist in our historical chat transcripts.
4.1 Selected Examples
The dual encoder produces 2 sets of embeddings: one for customer questions and another for agent answers. Table 3
shows the 5 nearest neighbors for a few selected questions in this embedding space. We also present the 5 nearest neighbors for some selected answers. In contrast to the nearest neighbors found with tf-idf vectorization, LSTM embeddings seem to capture more semantic similarity, since tf-idf is essentially based on search term overlap. For example, LSTM embeddings find various kinds of responses to customer greetings even when the search terms do not overlap very much (e.g. the model finds both “A pleasure to meet you too!” and “Glad to hear that!”), while tf-idf only finds ones that share tokens in common.
|Question or Answer||Encoder nearest neighbors||tf-idf nearest neighbors|
|ok you start the refund ?||is the refund done ? is the full refund done already ? just to be clear , you have issued a refund for the original order ? did the refund already go through ? but you ’ve already done a refund ?||should i start a new chat ? do you want me to start then ? i will try . how to i start ? how would you like to start ? and start the new return ?|
|a pleasure to meet you too !||glad to hear that ! it is our pleasure to assist our customers great ! ! my pleasure : ) fantastic !||nice to meet you ! nice to meet you : ) nice to meet you . it ’s a pleasure . it was a pleasure to meet and assist you :|
4.2 Answer Ranking
We compare the dual encoder network with the tf-idf baseline on an answer ranking task. For this task, we paired 10k randomly sampled customer questions with the correct answer and 9 randomly sampled incorrect answers. In this task, the “correct answer” is simply the agent response, which is not templatized. For each question, the 10 answers were ranked based on the probabilistic output of the dual encoder network. This ranking was compared to the one produced by the sum of the tf-idf term weights. We compare the mean reciprocal rank and precision@3 for both algorithms in Table 5. On all metrics, the dual encoder network significantly outperforms the tf-idf baseline. We also present examples of customer questions and the matching answers in Table 4.
|Question||Top 3 recommended answers|
|when will i receive my shoes ?||it will be delivered DATE you will get the items on DATE you ’ll receive the package within 24 hours .|
|how can i use the gift card balance ?||you can use it on your next purchase . you can use after 2 hours . because it will take only 1-2 hours to credit in your account . the refund will be reflected in your gift card balance in the next 1-3 hour|
|hi are you there ?||yes I ’m here . yes , i ’m checking it . sorry for the delay in responding|
|can i cancel the order ?||i can cancel it for you . i ’ve cancelled it . which items you need to cancel ?|
|why it has n’t been shipped yet ?||i am glad to check the status of your order . your order is already entered to the shipping process . it is out of stock.|
The tf-idf baseline does not perform well on this task because even when the vocabulary overlap contains signal for retrieving the answer given the question (e.g. both the question and answer contain “gift card”), there are many cases where the answer to a question will not have any overlap at all (e.g. yes or no-type questions).
4.3 End-to-end Human Evaluation
We recruited a rotating pool of agents to evaluate how well the system works end-to-end. We randomly selected 100 questions, and used our system and the tf-idf baseline to each recommend 3 answers (e.g. top 3 most probable answers from the 200 answer templates). The agent is asked to go through the question-answer pairs and assign a relevance score from 1 to 3 to each answer, with 3 being very relevant, 2 being somewhat relevant, and 1 being irrelevant. The evaluations are done on the same 100 questions for both algorithms.
Note that given a question there maybe more than one appropriate answer. For example, answers like “I’m sorry you can’t” and “Yes, you can cancel it from your order page” are both very relevant answers to the question “Can I cancel the order since it’s late?”. Table 4 shows sample model-based answers to questions, and Figure 2 shows the human evaluation relevance score distribution for both our system and tf-idf. In general, our system shows more high relevance recommendations compared with the tf-idf baseline, where 70% of the model-selected templates are relevant to the question being asked.
Another metric we examined is within the top three answers, how often is there at least one “very relevant” answer. Among the 100 randomly sampled questions, our system is able to recommend at least one very relevant answer (score = 3) among the top 3 for 48 questions, while the tf-idf baseline does so for only 31 questions.
The average relevance score for the tf-idf model is 1.66 ( 0.10, 95% CI), whereas the average relevance score for the dual encoder baseline is 2.08 ( 0.09, 95% CI).
5 Conclusion and Future Work
We have shown that a template-based approach to dialogue response generation works well in the customer service domain. We demonstrate that even in the absence of a fully automated dialogue system, it is nonetheless possible to select highly relevant answers to customer questions, which can translate into a reduction in the time spent per customer contact. Though we are currently testing this system for online chats, we believe that the template-based approach would extend naturally to a speech-driven system for telephone conversations.
There are a number of future directions we will pursue to make the system more complete. We would like to determine the correct polarity for a given template based on the state of a customer’s orders. For example, if a customer is inquiring about a shipment that has not yet arrived, we can reply either that the shipment is expected to be on-time or late from internal shipment data. We would also like to rerank the list of suggested templates using customer context, and expand the set of slots in our templates that can be filled automatically by internal systems.
The authors would like to thank Kevin Small for the helpful discussions and assistance in reviewing the drafts. We would also like to thank the customer service associates who helped us evaluate our system and provided domain-specific feedback on its design.
-  O. Vinyals and Q. Le, “A neural conversational model,” ICML Deep Learning Workshop, 2015.
-  I. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau, “Hierarchical neural network generative models for movie dialogues,” arXiv, 2015.
-  S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, “Weakly supervised memory networks,” arXiv, 2015.
-  J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merriënboer, A. Joulin, and T. Mikolov, “Towards ai-complete question answering: A set of prerequisite toy tasks,” arXiv, 2015.
-  J. Weston, S. Chopra, and A. Bordes, “Memory networks,” arXiv, 2014.
-  A. Kannan, K. Kurach, S. Ravi, T. Kaufmann, A. Tomkins, B. Miklos, G. Corrado, L. Lukács, M. Ganea, P. Young et al., “Smart reply: Automated response suggestion for email,” KDD, 2016.
-  J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. Säckinger, and R. Shah, “Signature verification using a ”siamese” time delay neural network,” IJPRAI, 1993.
-  R. Lowe, N. Pow, I. Serban, and J. Pineau, “The ubuntu dialogue corpus: A large dataset for research in unstructure multi-turn dialogue systems,” SIGDial, 2015.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, 1997.
Supervised Sequence Labelling with Recurrent Neural Networks, ser. Studies in Computational Intelligence. Springer, 2012.
-  I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” NIPS, 2014.
-  R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” NIPS, 2015.
-  J. Wieting, M. Bansal, K. Gimpel, and K. Livescu, “Towards universal paraphrastic sentence embeddings,” ICLR, 2015.
-  J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: A cpu and gpu math compiler in python,” Proc. 9th Python in Science Conf, 2010.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015.
-  D. Sculley, “Web-scale k-means clustering,” WWW, 2010.
-  D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007.
-  J. Williams, A. Raux, D. Ramachandran, and A. Black, “The dialog state tracking challenge,” SIGDIAL, 2013.