The focus of the work presented in this paper is to develop models that can help a person reply to an email query. This is very relevant in the customer care situation where agents frequently have to reply to similar queries from different customers. Of course, those queries that are similar solicit replies that are similar as well, sharing similar topic structures and vocabulary. Hence, providing suggestions to the agent with respect to the topic structure as well as to the content, in an interactive manner, can help in the effective composition of the email reply.
Typically, in customer care centres, in order to help the agents send replies to similar queries, the agents have access to a repository of canned responses. In response to a query from the customer, the agent searches among appropriate canned responses, makes appropriate modifications to the text, fills in information and then sends the reply. This process is both inflexible, as well as time consuming, specially in cases where the customer query is slightly different from one of the expected queries.
In what we are proposing here, the goal is to provide topic and content suggestions to the agent in a non-intrusive manner. This means that the agent can ignore any suggestion that is irrelevant to him/her during the composition of the message.
There are two types of suggestions that we are targeting:
Topic prediction of the entire email response that needs to be composed.
This can be useful for automatically suggesting an appropriate canned response to the agent. If such a document is available, then this can help the agent in planning the response.
Topic prediction of the next sentence in the reply.
This can be useful for interactively presenting the topics for the next sentence (and the corresponding representative sentence or phrases), which the agent can choose or ignore while composing the reply.
We show that, with the methods described in this paper, topic prediction of the entire email response as well as of the next sentence, can be made with reasonably high accuracy, making these predictions potentially useful in a scenario of interactive composition. We note however that these methods could not lead to a fully automatic composition, because, as all techniques that rely only on such textual training data, they do not have access to knowledge bases or similar external resources that the agent needs to consult in order to provide detailed and specific answers.
We evaluated our methods on a set of email exchanges in the Telecom domain in a customer care scenario. Three kinds of experiments were conducted:
Investigating the influence of the customer’s email on the word-level perplexity of the agent’s response, to validate and quantify our basic assumption that the context given by the customer’s query strongly conditions the agent’s response. The details are presented in Section 5.
Predicting the topics in the agent’s response given the customer’s query. See Section 6.1 for details.
Predicting the topic(s) in the agent next sentence given the context of his/her previous sentences (in addition to the customer’s query). See Section 6.2 for details.
2 Related Work
Most works addressing “email text analytics” in the past few years have been on classification and summarization. Email classification has proven to be useful in many standard applications such as spam detection and filtering high-priority messages. Research themes such as summarization and question answering came into focus because of the need of better interpreting the overwhelming amount of emails/messages generated with the advent of email groups and discussion forums. One of the earlier contributions to email summarization was the work by [Muresan et al.2001] whereas [Rambow et al.2004] extended it to email threads; Scheffer et al. Scheffer:04 on the other hand proposed semi-supervised classification techniques for question answering in the context of such threads.
Some of the recent classification and summarization techniques have been based on“speech acts” or “dialog acts” such as proposing a meeting, requesting information [Searle1976, Bunt2011]. Several email studies including summarizing of email threads [Oya and Carenini2014] or classification of emails [Cohen et al.2004] involve dialog-act based analysis. There has been very little work so far on customer-agent related email threads. Some of these works include identification of emotional emails related to customer dissatisfaction/frustration [Gupta et al.2013], as well as learning possible patterns/phrases for textual re-use in email responses [Lamontagne and Lapalme2004]. [Chen and Rudnicky2014] is a recent work that attempts to generate emails based on a two-stage process where a structural template is first produced and then a topic-specific language model is used for producing textual realizations of the different slots in the template (see also [Oh and Rudnicky2002] for an earlier work using a similar language model based approach).
There have been a number of works that have addressed the problem of discovering the latent structure of topics in the related area of spoken and chat conversations. Recently [Zhai and Williams2014] have addressed this problem using HMM-based methods handling dialog state transitions and topic content simultaneously; this work differs from ours in several respects. First, the nature of the data is not the same, short alternating conversational utterances in their case, large single email responses in ours. Second, the focus of [Zhai and Williams2014] is on the discovery of latent topics (and conversational states) based on existing dialog texts (speech transcripts or chats), using HMM-based techniques different from our LDA approach. Finally, in our case, in addition to discovering the latent structure of existing emails, we also actually predict which topics will likely be employed in the forthcoming agent’s response, which is not attempted in their paper.
The dataset that we used for our study is a collection of emails from the technical support team of a major telecom company in the UK. The dataset contains 54.7k email threads collected from the UK region during Jan 2013 May 2014. Usually, an email thread is started by a customer reporting a problem or seeking information, followed by an agent’s response suggesting fixes or asking for more details to fix the reported problem. These threads continue until the problem is solved or the customer is satisfied with the agent’s response. An example email conversation between a customer and an agent is given in the left column of Table 3. Usually, customer emails are in free form while agent replies have a moderately systematic structure. On average, there are 8 emails in a thread.
For our study, we just considered the first two emails in a thread, namely the original customer’s query and the corresponding agent’s reply. We have limited our experiments to emails which have at least 10 words for customer emails and 20 words for agent replies. This resulted in 48.3k email threads out of which we used 80% for training and 20% for testing. The statistics concerning the number of documents, as well as their average length in tokens (words) and sentences are given in Table 1.
4 Extracting Topics and Building a “Silver Standard” based on LDA
As we have explained, we focus on two main tasks:
Task T1: predict the likely overall topics of the whole agent’s reply based on the knowledge of the customer’s email ;
Task T2: when an agent is writing an email, predict the likely topics of the next sentence, based on the initial query and the additional knowledge of the previous sentences.
However, our training data are not annotated at the level of topics. In order to synthesize such an annotation, we use a popular unsupervised technique – Latent Dirichlet Allocation [Blei et al.2003] – for modeling the topic space of various views of the collection. There are potentially three ways for extracting topics in our set of conversations. In the first setting, we keep customer and agent emails separated, identifying distinct topic models for each collection: a document is a customer email in the first collection, and an agent email in the second collection. We denote by and the topic models trained on customer emails (using topics) and agent emails (using topics) respectively. In the second setting, we concatenate customer and agent emails and identify a unique, common set of topics: so, here, we are considering a collection of documents, where each document is the concatenation of the customer’s email and of the agent’s reply. The resulting model will be noted as , where is the number of topics in the model. In the third setting, instead of considering the whole email as a document, we take each sentence of the email as a separate document when building the topic models. The model is called . In our case, we built this model from the agent reply messages only, in order to specialize the sentence-level topic model on the agent “‘style” and vocabulary.
As outcome of the topic extraction process on the training sets, we have both the topic distribution over the training documents and the word distribution for each topic. Once trained, from these word distributions and the model priors, we can infer a topic distribution for each word of a given test document (this document being an entire email or a single sentence), and by aggregating these individual distributions, a global topic distribution can be derived for the whole test document. We consider these assignments of topic distributions as providing a silver-standard annotation of the documents (a proxy to a supervised “gold standard”). These ascribed annotations will subsequently be used for training our prediction models (using the silver-standard annotations of the training set) and evaluating them (using the silver-standard annotations of the test set). Additionally these topic assignments will not only be used as labels to be predicted, but also as additional features to represent and summarize at the semantic level the content of the customer’s query (for task T1) and the content of the previous agent sentences for task T2 (see details in Section 6).
We denote a pair of emails of the form customer-query / agent-reply by () and the application of a topic model to (resp. ) by (resp. ), with the number of topics and the type of model (), as described here above. The quantity
is a probability distribution overtopics and we define the dominant topic of a test sample as the topic with the highest probability in this distribution.
In Table 2, we present a sample of topics learned using the Sentences model (). In the table, we describe each topic with its top ranked words and phrases.
|Topic Label||Top words||Top phrases|
|Contact||support, technical, agent, team , write, contact,||write team,support agent|
|Feedback||contact, enquiry, leave, close, enquiry||answer query, follow link, close enquiry, leave feedback|
|Reset||reset, factory, datum, setting, tap, back, erase, storage||master reset, perform factory reset|
|Repair||repair, device, book, send, centre, email, back, warranty||repair center, book device repair|
|USB||usb, connect, cable, pc, charger, device||disk drive, default connection type, sync manager|
|Cache/App||clear, application, cache, app, datum, setting, delete||clear cache, manage application, cache partition|
|OS/Installation||update, software, system, setting, message, operating||system software update, installation error|
|SD Card||card, sd, account, save, tap, sim, import, people, application||sd card, google account, transfer contact, export sim card|
|Liability||return, charge, liable, un-repaired, device, quote, dispose||hold liable, free charge, choose pay, return handset|
|User Account||tap, account, enter, password, email, setting, step, set, require||username password, email account, secure credentials|
|Damage||repair, charge, brand, return, economic, unrepaired||brand charge, liquid-damaged accessory, return immediately|
|SIM/SD Card||card, data, sim, sd, phone, store, device, service, online||test sim card, data remove, insert sim card|
|Settings||tap, scroll, screen, setting, home, icon, notification||home screen, screen tap, notification bar|
|Customer Query: My mobile x fell out of my pocket and the screen cracked completely, I was wondering whether I am eligible for repair as it is still under the 24 month warranty?||screen crack, hardware operation, smash screen, display, htc, lcd||month ago, phone month, buy htc, year ago, phone warranty, contract, contact||phone work, month ago, time, problem, htc, issue, week, day, back|
|Agent Response:Thank you for contacting HTC regarding your HTC One X. My name is John and I am a Technical Support Agent for the HTC Email Team. I’m sorry to hear that you are experiencing difficulties with your device. I understand that the screen is broken. Unfortunately this is not covered by warranty, so if you wish to have it repaired, you will have to pay a quote. The quotation will be made by the repair centre, and it is based on an examination of the handset that is done when it arrives in the repair centre. This is why we are unable to provide you with the amount it would cost to have the display replaced. I hope that I have given you enough information to solve your query. If this is not the case, please do not hesitate to contact us again. If this answer has solved your query, and you have no further questions, you can close this ticket by clicking on the link shown below. On closing the ticket, you will receive an invitation to participate in our Customer Satisfaction Survey. This will only take 1 minute of your time. I wish you a pleasant day.||contact htc, write team, support agent, htc regard, technical support, contact htc regard||repair centre, vary depend exchange rate, physical damage, minor liquid, cover warranty||leave feedback, close ticket, contact quickly, receive feedback|
In Table 3, we present an example of a query/reply email pair, as well as the corresponding Top-3 highest probability topics and their most representative words/phrases, both for the customer and the agent parts ( model, with ).
Finally, it should be noted that the Sentences model () gives rise to more peaked topic distributions compared to the other models. We consider that a distribution is peaked when the probability of its dominant topic is more than 0.5 (that is, more than the aggregated probability of all competing topics) and at least twice larger than that of the second topic. Over the test set, more than 90% of sentences exhibit a peaked distribution when considering the topic distribution (see Figure 1). This will motivate the use of the dominant topic instead of the whole distribution when solving the task T2, as explained in Section 6.2.
5 Influence of the Customer Query on the word-level perplexity
We examine the influence of knowing the context of the customer’s query on the content of the agent’s email. In order to do that, we consider the set of test agent emails and compare the perplexity of the language model based on versus the one identified with . Recall that the model infers the probability distribution () over topics by exploiting only the agent emails from the training set, while the model infers a probability distribution () over topics by exploiting both the customer queries and the agent replies in the training set.
The perplexity scores are computed using the following formulas:
In these equations, the test set is , with and , is the number of agent emails in the test set and is the total number of words in . The term (resp. , ) is the likelihood of the sequence of words in (resp. , ), as given by the LDA model (resp. , ).
In Figure 2, we present the perplexity scores of the two models. We see that the model which uses the customer’s email as context has lower perplexity scores, as could be expected intuitively. This indicates that a generative LDA model has the potential to use the context to directly improve the prediction of the words in . However, in this paper, instead of directly trying to predict words (which is strongly connected with the design of the user interface, for instance in the form of semi-automatic word completion), we will focus on the different, but related, problem of predicting the most relevant topics in a given context. As topics could be rather easily associated with canned responses (sentences, paragraphs or whole emails), predicting the most relevant topics amounts to recommending the most adequate responses.
6 Predicting Relevant Topics of the Agent Response
6.1 Topic prediction for the overall agent’s email
In this section, we focus on Task T1, namely predicting the topic distribution of the agent response using only the contextual information : the customer query and its topic distribution . The choice of the model rather than is motivated by the considerations described in the previous section (Section 5). Note that the model is used both to compute synthetic semantic features () and to provide a silver standard for the topic prediction (). The predictor can be written as:
where represents the bag-of-words of the customer query.
Learning the mapping shown in Equation 2
could be considered as a structured output learning problem. For solving it, we use an extension of logistic regression that can be trained with soft labels (the silver standard annotations given by
), adopting the Kullback-Leibler divergence betweenand
as loss function and using a simple Stochastic Descent Gradient algorithm to optimize this loss function. Recall that the silver-standard labels are used both for building the predictor (from the training set) and for assessing the quality of the predicted topic distribution on the test set.
6.2 Topic prediction for each sentence of the agent’s email
To solve our second task (Task T2), namely predicting the topic distribution of the next sentence of an agent’s response, we use the words of the customer query , its topic distribution , the words of the current sentence and the topic distribution of the current sentence . Note that we are making some kind of Markovian assumption for the agent-side content: we consider that the current sentence and its topic distribution is sufficient to predict the topic of the next sentence, given the customer query context. Noting the sentence in the agent email , we then build the predictor as:
where is the sentence position (index), and where and represent the bag-of-words of the customer and agent emails, respectively.
In practice, as we mentioned in Section 4, the topic distributions for sentences using the models are highly peaked at the “dominant” topic. So, it makes sense to use , the dominant topic of the distribution instead of the whole distribution. In the same vein, instead of trying to predict the whole topic distribution of the next sentence, it is reasonable to predict only what will be its dominant topic. So, a variant of equation 3 is:
We use the standard multiclass logistic regression for modeling the function shown in equation 4. To be more precise, we build different predictors, one for each possible value of the dominant topic of the current sentence . Moreover, for =0, i.e. for the first sentence, we build a family of simpler “degenerated” models, in the following form:
In this section, we present experimental results, showing how our proposed methods perform in predicting topics of the agent’s response. For learning the LDA topic models (as described in section 4), we have used MALLET [McCallum2002] toolkit, with the standard (default) setting.
We evaluate our methods using three metrics:
Bhattacharya coefficient [Bhattacharya1943]
Here, we evaluate how close the predicted topic distribution is to the silver-standard topic distribution. For Task T1, we compare with for each agent email of the test set. For Task T2, we compare with for each sentence of the agent emails of the test set. We have also computed more commonly used measures such as KL divergence and they strongly correlate our findings with our Bhattacharya coefficient scores.
Text ranking measure (for Task T1)
Instead of directly comparing the probability distributions, we also try to measure how useful is the predicted probability distribution in discriminating the correct agent’s response in comparison to a set of -1 randomly introduced responses from the training set. The possible answers are ranked according to the Bhattacharya coefficient between their silver-standard topic distribution and the one predicted from the customer’s query email following equation 2. We consider here the average Recall@1 measure, i.e. the average number of times where the correct response is ranked first.
Dominant topic prediction accuracy (for Task T2)
Here, we examine whether the dominant topic of the silver-standard annotation for the next sentence belongs to the top-K predicted topics. A high accuracy is necessary for ensuring effective topical suggestions for interactive response composition, which is the primary motivation of our work.
7.1 Topic prediction of agent email
In Figure 3, we present the average Bhattacharya coefficient over all test emails for different numbers of topics () using the model.111For our purposes, the Bhattacharya coefficient is preferable to other measures of distribution “distance” such as KL-divergence, because it is upper-bounded by , and allows easier comparison across distributions of varying dimensions. This figure illustrates the trade-off between the difficulty of the task and the usefulness of the model: a higher number of topics corresponds to a more fine-grained analysis of the content, with potentially better predictive usefulness, but at the same time it is harder to reach a given level of performance (as measured by the Bhattacharya coefficient) than with a small number of topics. For comparison, we also show a baseline where the prediction of the agent’s email topic distribution is simply a copy of the customer’s email topic distribution, with a much lower performance.
Figure 4 gives the evolution of the average Recall@1 measure for when the number of topics is changed. In this case, the baseline (a simple random guess) would have given an average Recall@1 equal to 20%. The best performance (Recall@1 = 52.5%) is reached when =50.
We have also assessed the average Recall@1 score with varying and fixed to 50: see Figure 5. We see that the text ranking based on the topic distribution prediction is always much higher than the baseline scores.
7.2 Topic Prediction of Next Sentence
We compare the dominant topic prediction accuracy of the proposed approach with other baseline approaches. The baseline approaches that we examined are:
we assign uniform distribution of topics for every sentence in the test set and compare it with silver standard. In other words, we perform a completely random ranking of the topics.
Average: we assign the same topic distribution for every sentence of the test set; this topic distribution is the global average topic distribution and can be directly derived from the hyper-parameter in LDA models by “normalizing” the values of the vector (note that, in our case, this hyper-parameter is learned from the training data).
Table 4 gives the dominant topic prediction accuracy for the “next topic prediction” task, with =1 (i.e. the relative number of times where the predicted dominant topic corresponds to the dominant topic given by the silver-standard annotation). These values are averaged over all sentences of the test set, irrespective of their position. For this, we have used the predictors given by equation 4 and fixed the number of topics to 50. Note that the standard multi-class logistic regression outputs a probability distribution over the topics, so that we can compute the Bhattacharya coefficient between this predicted distribution and the silver-standard distribution as well: this information is also given in Table 4. To see the relative impact of each type of input features, the table gives the performance for specific subsets of features: the notation represents the bag-of-word feature vector of text entity , while represents the topic distribution vector of text entity .
We see that the topic prediction accuracy for the next sentence is 0.471, which is much higher than both baselines (Uniform and Average). We can also see that the topic distribution vectors of the current sentence and the customer’s query give a higher prediction accuracy (0.450) than using the bag of words features of the context (0.416). When the position of the next sentence is used as a feature, it improves the results indicating that certain topics are more likely to occur at particular positions in the email than at others. The same trend is also seen when we compute the Bhattacharya coefficient (BC) where the predicted topic distribution has a much higher BC than both the uniform distribution as well as the Average distribution.
These results (presented in table 4) illustrate that about half of the predicted topics match with the actual topic (based on silver-standard annotations), which is a significant accuracy in an interactive composition scenario. In Table 5, we show the dominant topic prediction accuracies in the top- predicted topics, with different values of . It can be seen that, in an interactive composition scenario where the agent is presented with 5 recommended topics, the agent will be able to recognize the relevant topic in more than 80% of the cases. When the agent is presented with 2 recommended topics, the agent can choose the right topic in 62.5% of the cases. These results are obtained through a combination of all the features, with the topic distribution vectors of the context (current sentence and customer’s query) playing an important role.
|Features||Dominant Topic in top_K predictions|
8 Conclusion and Future Work
We have presented new unsupervised models for discovering discourse structures of email replies in customer-agent oriented email systems, and have evaluated their predictive ability on a real-world Contact-Center email dataset, at the global and local levels. Our experiments indicate the potential of these techniques for an interactive scenario where the agent is guided in the selection of whole emails or individual sentences based on predicted topics.
Still, numerous interesting avenues could be investigated further. One natural extension of our work would be to consider Multi-dimensional LDA models in the sense of [Paul and Dredze2013], which are able to detect topics along different semantic aspects, which would be very useful for disentangling several dimensions that we currently do not distinguish, such as: which issue is being talked about, what is the device concerned, at what stage of a conversation we are, and so on.
Another extension would be to examine prediction models that go beyond the Markovian assumption, by exploiting topic dependencies at a longer distance than one sentence.
The intended application of this work requires developing a user interface, which has implications on the models (level of granularity, number of topics, …), as well as going beyond the identification of topics towards the interactive generation of actual texts. Ultimately, we want to be able to automatically recognize what is the most relevant response in a given context, not only because it was often given by agents in similar contexts, but because this response will lead to the fastest and more efficient way of solving the customer’s problem: we should then couple and solve jointly the topic modelling/prediction task with a sequential optimization problem.
- [Bhattacharya1943] Anil Bhattacharya. 1943. On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society, 35:99–109.
[Blei et al.2003]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
Latent dirichlet allocation.
Journal of Machine Learning Research, 3:993–1022.
- [Bunt2011] Harry Bunt. 2011. The semantics of dialogue acts. In Proceedings of the Ninth International Conference on Computational Semantics, pages 1–13. Association for Computational Linguistics.
[Chen and Rudnicky2014]
Yun-Nung Chen and Alexander Rudnicky.
Two-stage stochastic natural language generation for email synthesis by modeling sender style and topic structure.In Proceedings of the 8th International Natural Language Generation Conference (INLG), pages 152–156, Philadelphia, Pennsylvania, U.S.A., June. Association for Computational Linguistics.
- [Cohen et al.2004] William W. Cohen, Vitor R. Carvalho, and Tom M. Mitchell. 2004. Learning to classify email into “Speech Acts”. In EMNLP, pages 309–316.
- [Gupta et al.2013] Narendra Gupta, Mazin Gilbert, and Giuseppe Di Fabbrizio. 2013. Emotion detection in email customer care. Computational Intelligence, 29(3):489–505.
- [Lamontagne and Lapalme2004] Luc Lamontagne and Guy Lapalme. 2004. Textual reuse for email response. In Advances in Case-Based Reasoning, pages 242–256. Springer.
- [McCallum2002] Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.
- [Muresan et al.2001] Smaranda Muresan, Evelyne Tzoukermann, and Judith L Klavans. 2001. Combining linguistic and machine learning techniques for email summarization. In Proceedings of the 2001 workshop on Computational Natural Language Learning-Volume 7, page 19. Association for Computational Linguistics.
- [Oh and Rudnicky2002] Alice H. Oh and Alexander I. Rudnicky. 2002. Stochastic language generation for spoken dialogue systems. Computer Speech and Language, 16:387–407.
- [Oya and Carenini2014] Tatsuro Oya and Giuseppe Carenini. 2014. Extractive summarization and dialogue act modeling on email threads: An integrated probabilistic approach. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 133–140, Philadelphia, PA, U.S.A., June. Association for Computational Linguistics.
- [Paul and Dredze2013] J. Michael Paul and Mark Dredze. 2013. Drug extraction from the web: Summarizing drug experiences with multi-dimensional topic models. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 168–178. Association for Computational Linguistics.
- [Rambow et al.2004] Owen Rambow, Lokesh Shrestha, John Chen, and Chirsty Lauridsen. 2004. Summarizing email threads. In Proceedings of HLT-NAACL 2004: Short Papers, pages 105–108. Association for Computational Linguistics.
- [Scheffer2004] Tobias Scheffer. 2004. Email answering assistance by semi-supervised text classification. Intelligent Data Analysis, 8(5):481–493.
- [Searle1976] John R Searle. 1976. A classification of illocutionary acts. Language in society, 5(01):1–23.
- [Zhai and Williams2014] Ke Zhai and Jason D. Williams. 2014. Discovering latent structure in task-oriented dialogues. In ACL (1), pages 36–46.