Multiple topic identification in telephone conversations

12/21/2018 ∙ by Xavier Bost, et al. ∙ Université d'Avignon et des Pays de Vaucluse 0

This paper deals with the automatic analysis of conversations between a customer and an agent in a call centre of a customer care service. The purpose of the analysis is to hypothesize themes about problems and complaints discussed in the conversation. Themes are defined by the application documentation topics. A conversation may contain mentions that are irrelevant for the application purpose and multiple themes whose mentions may be interleaved portions of a conversation that cannot be well defined. Two methods are proposed for multiple theme hypothesization. One of them is based on a cosine similarity measure using a bag of features extracted from the entire conversation. The other method introduces the concept of thematic density distributed around specific word positions in a conversation. In addition to automatically selected words, word bi-grams with possible gaps between successive words are also considered and selected. Experimental results show that the results obtained with the proposed methods outperform the results obtained with support vector machines on the same data. Furthermore, using the theme skeleton of a conversation from which thematic densities are derived, it will be possible to extract components of an automatic conversation report to be used for improving the service performance. Index Terms: multi-topic audio document classification, hu-man/human conversation analysis, speech analytics, distance bigrams



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has been a growing interest in recent years on speech technology capabilities for monitoring telephone services. In order to provide a high level of efficiency and user satisfaction, there is a consensus on the importance of improving systems for analysing human/human conversations to obtain reports on customer problems and the way an agent has solved them. With these reports, useful statistics can be obtained on problem types and facts, the efficiency of problem solutions, user attitude and satisfaction.

The application considered in this paper deals with the automatic analysis of dialogues between a call centre agent who can solve problems defined by the application domain documentation and a customer whose behaviour is unpredictable. The customer is expected to seek information and/or formulate complaints about the Paris transportation system and services.

The application documentation is a concise description of problem themes and basic facts of each theme considered as a conversation topic. The most important speech analytics for the application are the themes even if facts and related complementary information are also useful.

A conversation may contain more than one semantically related themes. Depending on the type of relation, some themes discussed in a conversation may be irrelevant for the application task. For example, a customer may inquiry about an object lost on a transportation mean that was late. In such a case, the loss is a much more relevant theme than the traffic state. A component of the automatic analysis is an Automatic Speech Recognition (asr

) system that makes recognition errors. In the above example, failure to detect the mention to traffic state can be tolerated if the mention to the loss fact is correctly hypothesized but in general, all the mentioned themes must be taken into account. This paper focuses on the detection of relevant themes in a service dialogue. Conversations to be analysed may be about one or more domain themes. Detecting the possibility of having multiple themes and identifying them is important for estimating proportions of customer problems involving related topics. Different themes can be mentioned in disjoint discourse segments. In some cases, mentions of different themes may coexist in short segments or even in a single sentence. Even when different themes are mentioned in non-overlapping discourse segments, segment boundaries may be difficult to estimate because of the errors introduced by the

asr system and the imprecise knowledge about the language structures used by casual users in the considered real-world situation. In spite of the just discussed difficulties, it is worth considering the possibility of extracting suitable features for theme detection considering that it is likely to have several mentions of a theme in a conversation.

The paper is structured as follows. Section 2 of the paper discusses related work. Section 3 introduces the application domain and the features used for theme hypothesization. The first of two approaches to multiple theme identification is introduced in section 4. It is based on the automatic estimation of decision parameters applied to global cosine similarity measures. The second approach, described in section 5, introduces the concept of dialogue skeleton based on which thematic densities of features can be computed and used as soft detections of possibly overlapping locations in a dialogue where one or more themes are mentioned. Decision making with this approach is also described. Experimental results are presented in Section 6.

2 Related work

Human/human spoken conversation analyses have been recently reviewed in [1]. Methods for topic identification in audio documents are reviewed in [2]. Solutions for the detection of conversation segments expressing different topics have been proposed in many publications, recently reviewed in [3]. Interesting solutions have been proposed for linear models of non-hierarchical segmentation. Some approaches propose inference methods for selecting segmentation points at the local maxima of cohesion functions. Some functions use features extracted in each conversation sentence or in a window including a few sentences. Some search methods detect cohesion in a conversation using language models and some others consider hidden topics shared across documents.

A critical review on lexical cohesion can be found in [4] who propose an unsupervised approach for hypothesizing segmentation points using cue phrases automatically extracted from unlabelled data. An evaluation of coarse-grain segmentation can be found in [5].

Multi-label classification is discussed in [6] and [7] mostly for large collections of text documents. Particularly interesting is a technique called creation consisting in creating new composite labels for each association of multiple labels assigned to an instance.

This paper extends concepts found in the recent literature by introducing a version of bigram words that may have a gap of one word. These features are used in decision strategies based on a constrained application of the cosine similarity and on a new definition of soft theme density location in a conversation.

3 Application task and features

The application task is about a customer care service (ccs). The purpose of the application is to monitor the effectiveness of a call centre by evaluating proportions of problem items and solutions. Application relevant information is described in the application requirements focussing on an informal description of useful speech analytics not completely defined. The requirements and other application documentation essentially describe the types of problems a ccs agent is entitled to solve.

This paper proposes a new approach for automatically annotating dialogues between an agent and a customer with one or more application theme labels belonging to the set defined as follows:

{ itinerary, lost and found, time schedules, transportation card, traffic state, fine, special offers }.

Given a pair , where , a spoken dialogue in the corpus , is described by a vector of features and is a class corresponding to a theme described by a vector of features of the same type as , two classification methods, indicated as and , are proposed for multiple theme classification.

With the purpose of increasing the performance of automatic multiple theme hypothesization, bigrams were added to the lexicon of 7217 words with the possibility of having also distance bigrams made of pairs of words distant a maximum of two words. The feature set increases to 160433 features with this addition. In order to avoid data sparseness effects, a reduced feature set

was obtained by selecting features based on their purity and coverage. Purity of a feature is defined with the Gini criterion as follows:


where is the number of dialogues of the train set containing term and is the number of dialogues of the train set containing term in dialogues annotated with theme .

A score is introduced for feature in the entire train set collection of dialogues discussing theme . It is computed as follows:


where is the inverse document frequency for feature .

4 Using a global cosine similarity measure

The classical cosine measure of similarity between the two vectors and is defined as:


where is a score for feature in dialogue .

Let be the set of themes discussed in dialogue . A first decision rule for automatically annotating dialogue with a theme class label is:


where ; and is an empirical parameter whose value is estimated by experiments on the development set.

If the score of is too low, then the application of the above rule is not reliable. To overcome this problem, the following additional rule is introduced:


where is another parameter whose value is estimated by experiments on the development set.

4.1 Parameter estimation

The values of parameters and have been estimated using 20 subsets of 98 dialogues each belonging to the development set and selected with the same proportion of single and multiple theme dialogues as in the development set. In order to estimate the optimal values and of these two parameters, the following decision rule has been applied:



is the F-score (defined in the following subsection) computed using model

with the estimated values of and in the subset .

The optimal value of , the proportion of the highest score required for assigning themes to a dialogue, has been evaluated to , and the optimal value of , the threshold required, to .

4.2 Performance measures

The proposed approaches have been evaluated following procedures discussed in [7] with measures used in Information Retrieval (ir) and accuracy as defined in the following for a corpus and a decision strategy .

Of particular importance is the F-score: based on this measure, and as mentioned in subsection 6.3

, it is possible to find the best trade-off between precision and recall by rejecting some of the dialogues.

Recall :


where indicates the set of theme labels annotated for conversation .

Precision :


F-score :


Accuracy :


5 Automatic annotation based on thematic densities

5.1 Thematic density

The contribution to theme of the features at the -th location in a dialogue is:


where is the set made of the -th word in a conversation and the bigrams associated with it.

A thematic density of theme is associated with position and is defined as follows:


where is a parameter of sensitivity to proximity whose value is estimated by experiments on the development set and .

5.2 Dialogue skeleton

The theme density at a specific dialogue location makes it possible to derive a thematic skeleton of a dialogue.

Figure 1 shows the skeleton of a dialogue obtained with an automatic transcription for and (fig 1- a) and (fig 1- b). The figures plot the thematic density as function of the dialogue location measured in numbers of words preceding the location. The conversation is about a request of bus schedule (indicated as horr) and a fare (indicated as tarf). Three functions are plotted. Two for these themes and a third for the theme itinerary (indicated as itnr).

Figure 1: Thematic densities as function of location in a dialogue skeleton. Densities are plotted for and (fig 1- a) and (fig 1- b) and three themes: schedule (indicated as horr), fare (indicated as tarf) and itinerary (indicated as itnr.

As an example, an excerpt from the dialogue obtained from manual transcriptions is reported in the following. The dialogue positions corresponding to each turn are in brackets.

  • Customer [82–100] : I would like to know if busses start running in early morning.

  • Agent [101–118] : First start is at 5 :45 at Opera square.

  • (…)

  • Agent [304–314] : It will be there at 8 :48.

  • (…)

  • Agent [442–470] : eh no …. There are specific fares for the airport. I’ll give you the amount …. Fare is 9 euros 10.

  • Customer [471–496] : 9 euros 10, should we pay 9 euros 10 cash.

With (fig 1- b), distant contexts tend to be neglected. In this case decision may be adversely affected by isolated features that may not be relevant for theme hypothesization. For example, the expression there (“la-bas” in French) in turn [304–314] tends to show more evidence for itinerary with a peak of density at location 300, while it is used here in a request of schedule.

Local context is more appropriately taken into account with (fig 1- a) with the result of reducing the relevance of the itinerary hypothesis just supported by the word there. When (horizontal lines in fig 1- a), close contexts tend to be neglected giving more importance to global features used in the approach introduced in Section 4.

In conclusion, with a well-suited value of (=1.05), thematic coherence tends to make decisions more accurate.

5.3 Decision making

A theme is considered as discussed in a dialogue if it has dominant density at a location of the dialogue (rule (13)) and if the sum of its densities in positions where it is dominant exceeds an empirically determined threshold (rule (14)) :


where is a parameter whose value is estimated by experiments on the developement set; is the theme of dominant density at the -th position in the dialogue; and .

6 Experiments

6.1 Experimental framework

Experiments have been performed using an asr system described in [8]

. It is based on triphone acoustic hidden Markov models (

hmm) with mixtures of Gaussians belonging to a set of 230000 distributions. Model parameters were estimated with maximum a posterioriprobability (map) adaptation of 150 hours of speech in telephone bandwidth with the data of the train set. A corpus of 1658 telephone conversations was collected at the call centre of the public transportation service in Paris. The corpus is split into a train, a development and a test set containing respectively 884, 196 and 578 conversations. A 3-gram language model (lm) was obtained by adapting with the transcriptions of the train set a basic lm. An initial set of experiments were performed with this system resulting with an overall wer on the test set of 57% (52% for agents and 62% for users). These high error rates are mainly due to speech disfluencies and to adverse acoustic environments for some dialogues when, for example, users are calling from train stations or noisy streets with mobile phones. Furthermore, the signal of some sentences is saturated or of low intensity due to the distance between speakers and phones.

The annotation with possible multiple themes of the development and test corpora has been performed in a batch process by maximizing the agreement between three annotators.

It is important to notice that the training corpus dialogues have been labelled on the fly by the agents with the constraint to choose one and only one theme corresponding to the main customer concern. When this was not clear, the annotation was based on the problem expressed at the beginning of the conversation. With such a procedure it is not possible to create bi-labels as described in [6] and [7] and mentioned in Section 2.

6.2 Evaluation

For the sake of comparison, the results obtained with the proposed classification approaches have been compared with the results obtained with a support vector machine (svm) using the same features (unigrams and bigrams with possible gap of one word) and a linear kernel. For every theme

, a binary classifier

is defined. Every pair is associated with the score computed by this classifier. The candidate theme hypotheses for a conversation are those whose score is in an interval corresponding to an empirically determined proportion of the highest one. In addition to that, the hypothesis with the highest score must be above a threshold empirically determined for this purpose.

Results obtained with the cosine measure and with the theme density are reported in Table 1 for the development set and in Table 2 for the test set. man and asr respectively indicate manual transcriptions and automatic transcriptions obtained with the most likely sequence of word hypotheses generated by the asr system.

dev man asr svm cos. dens. svm cos. dens. Accuracy 0.77 0.85 0.85 0.70 0.80 0.81 Precision 0.88 0.92 0.94 0.79 0.87 0.90 Recall 0.85 0.92 0.88 0.81 0.89 0.86 F-score 0.86 0.92 0.91 0.80 0.88 0.88

Table 1: Results obtained with the svm, the cosine measure and the theme density for the development set.

test man asr svm cos. dens. svm cos. dens. Accuracy 0.74 0.78 0.78 0.64 0.71 0.71 Precision 0.85 0.86 0.86 0.72 0.79 0.79 Recall 0.83 0.87 0.85 0.77 0.83 0.80 F-score 0.84 0.87 0.85 0.75 0.81 0.80

Table 2: Results obtained with the svm

, the cosine measure and the theme density for the test set. For the cosine-based method, the estimated confidence interval is

2.74% for manual transcriptions, and 3% for the asr output.

6.3 Results analysis

The development set was collected in the same time period (fall) as the train set, while the test set was collected in the summer. The difference in the results obtained with the test and the development sets can be explained in part considering the frequency of different events in the two time periods (e. g. strikes in the fall and specific maintenance works in the summer).

The strategies of all the used methods for hypothesizing a theme in addition to the dominant one give better results on the hypothesized latter theme compared to those obtained for the former one (monolabel categorization). For example, using the relative value of the cosine measure with the dev set, improvements from 0.88 to 0.92 using the manual transcriptions and from 0.85 to 0.88 with the automatic transcriptions are respectively observed for the F-score. The same improvements are observed with the test set.

Using the development set for inferring a rejection rule based on close scores between the first two hypotheses, an F-score of 0.83 is obtained on the automatic transcriptions of the test set with a rejection rate close to 10%, the rate of disagreement between human annotators.

7 Conclusion and future work

Features have been proposed for the hypothesization of one or more themes mentioned in a conversation between a call centre agent and a calling customer. They are sets of words, bigrams and distant bigrams automatically selected in the application domain.

Two approaches have been proposed for multiple theme hypothesization. A first approach is based on a cosine similarity measure applied to the features extracted from a conversation and the other based on a new definition of theme density obtained considering a conversation skeleton. The approaches have been evaluated and have shown to outperform an svm classifier using the same features and data.

Future research will include the search for confidence measures suitable for the multi topic task and the use of conversation skeletons for extracting short reports on agent/customer dialogues. From these reports, proportions of speech analytics will be extracted and used for monitoring frequencies, importance and solution rates for different type of facts and problems.

8 Acknowledgements

This work is supported by the French National Research Agency (anr), Project decoda, and the French business clusters Cap Digital and scs. The corpus has been provided by the ratp (Paris public transport company).


  • [1] Tur, G. and Hakkani-Tur D., “Human/human conversation understanding”, ch. 9 of (G. Tur and R. De Mori), pp. 228–255, J. Wiley, 2011.
  • [2] Hazen, T. J., “Topic identification”, ch. 12 of (G. Tur and R. De Mori), pp. 319–356, J. Wiley, 2011.
  • [3] Purver M., “Topic segmentation”, ch. 11 of (G. Tur and R. De Mori), pp. 290–317, J. Wiley, 2011.
  • [4] Eisenstein, J. and Barzilay, R., “Bayesian Unsupervised Topic Segmentation”, EMNLP, 2008, pp. 334–343.
  • [5] Niekrasz, J. and Moore, J. D., “Unbiased discourse segmentation evaluation”, 2010 ieee Spoken Language Technology workshop, Berkeley, CA, Dec 2010, pp. 43–48.
  • [6] Carvalho, A. and Freitas, A., “A tutorial on multi-label classification techniques”, Foundations of Computational Intelligence, Vol. 5 of Studies in Computational Intelligence 205, pp. 177–195, Springer, September 2009.
  • [7] Tsoumakas, G. and Katakis, I., “Multi-label classification: an overview”, International Journal of Data Warehousing and Mining, 3(3): pp. 1–13, 2007.
  • [8] Linares, G., Nocera, P., Massonie, D. and Matrouf, D., “The lia speech recognition system: from 10xrt to 1xrc”, International Conference on Speech, Text and Dialogue, Pilsen, Tcheck Republic, 2007, Lecture Notes in Computer Science, volume 4629/2007, pp. 302–308.
  • [9] Maza, B., El-Beze, M., Linares, G. and De Mori, R., “On the use of linguistic features in a automatic system for speech analytics of telephone conversations”, Interspeech 2011, pp. 2049–2052.
  • [10] Koco, S., Capponi, C. and Bechet, F., “Applying multiview learning algorithms to human-human conversation classification”, Interspeech 2012, Portland.
  • [11] Tur, G. and De Mori, R. Eds, “Spoken Language Understanding: Systems for Extracting Semantic Information from Speech”, J. Wiley, 2011.