Combining Textual Content and Structure to Improve Dialog Similarity

02/20/2018 ∙ by Ana Paula Appel, et al. ∙ ibm 0

Chatbots, taking advantage of the success of the messaging apps and recent advances in Artificial Intelligence, have become very popular, from helping business to improve customer services to chatting to users for the sake of conversation and engagement (celebrity or personal bots). However, developing and improving a chatbot requires understanding their data generated by its users. Dialog data has a different nature of a simple question and answering interaction, in which context and temporal properties (turn order) creates a different understanding of such data. In this paper, we propose a novelty metric to compute dialogs' similarity based not only on the text content but also on the information related to the dialog structure. Our experimental results performed over the Switchboard dataset show that using evidence from both textual content and the dialog structure leads to more accurate results than using each measure in isolation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Chatbots have become more and more popular over the past years, owing for instance to the success of text messaging apps, advances in natural language processing techniques, and advances in scalability. Solutions based on chatbots allow industry services to be 24-hours a day available to their customers, cutting expenses, and automating many of their business processes 

(forbes, ). Chatbots can be task-oriented, designed to a particular task to get information from the user to help complete the task (e.g., booking airline flights) or can be systems designed for extended conversations, mimicking of human-human interaction (e.g., psychological counseling) (chatbot_book, ). A by-product of that is the increasing volume of data, in special textual dialogs, that is being generated by systems making use of chatbots. There is demand for approaches to better understanding such data to either gather insights from the users or to improve the chatbots since data analytics for chatbots is still an area that needs more investigation.

In such sense, one kind of analysis that could help understanding chatbot data is to find similarity between dialogs. Therefore, grouping similar dialogs can give not only insights about the customers that interact with the chatbot but also can help on the creation corpus-based chatbots that are built mining conversations between human-human or human-machine conversations (serban-2015-survey, )

. Finding similar objects is a traditional task in several areas as content-based image retrieval 

(liu2007survey, ), text similarity identification (hotho2005brief, ), code plagiarism detection (maurer2006plagiarism, ), song identification (typke2005survey, ) and so on. For dialogs, such similarity-based retrieval could be useful to recover dialogs that present some similarity either in content or structure, and the identification of similar sets of dialogs can be valuable to better understand the data and further improve the content or structure of the chatbots.

Most of the work in dialog aims at studying specific details of the dialog or the users with the goal of, to cite a few, identify genders (kose2007mining, )

, classify speech acts 

(carpenter2011role, ; ferschke2012behind, ; schabus2016data, ), identify socio-cultural phenomena (strzalkowski2010modeling, ), understand user interaction (khan2002mining, ), study linguistic coordination (danescu2011chameleons, ), or predict emotions (litman2004predicting, ), (maeireizo2004co, ). Another front can be observed in the efforts to build corpora for dialog systems (serban2015survey, ), dialog generation (walker2012annotated, ; serban2017multiresolution, ; serban2017hierarchical, ), dialog generation evaluation (liu2016not, ), dialog control (williams2016end, ) and build specific corpus such as Ubuntu (lowe2015ubuntu, ) or modeling Twitter as a dialog conversation (ritter2010unsupervised, ). However, none of the supra-cited works address dialog similarity. A lot of works in dialog is focus on spoken dialog, to fill gaps (mesnil2017, ), control dialog (williams2017hybrid, ; bordes2016learning, ; wen2016network, ) and measure sucess (su2015learning, ).

One way to compare two dialogs is to solely consider the textual content of the dialog and then use some standard text similarity technique. Text similarity is a well-known task when people mine large volume of text data, and several techniques have been developed (metzler2007similarity, ; gomaa2013survey, ). Nevertheless, data from chatbots or dialogs present also a structure that goes beyond traditional text data. Chatbots data present some specific characteristics such as the order and the time interval of the utterances, the user interactions, speech acts, turns and so on. For that reason, computing dialog similarity becomes a complex task than traditional text similarity.

In this paper, we propose a similarity metric based on both textual content and structural properties of dialogs. To evaluate our proposed metric, we perform experiments on the well-know Switchboard dataset (SWITCHBOARD92, ). To the best of our knowledge, we are the first to address the problem of dialog similarity and thus, in this paper, we will present:

  • A new metric for detecting similarity between dialogs;

  • New metrics that combine metrics from dialog structure and metrics from the text to better represent dialogs;

  • Experiments showing that these new metrics works better for dialogs them only text;

  • A discussion on the challenges of this new area and the next steps to be taken.

The remainder of this paper is organized as follows. The next section presents our proposed similarity metric. Section 3 presents our evaluation methodology and reports the results of our experimental evaluation. Finally, Section 4 concludes the paper and discuss future work.

2. Dialog Similarity

Two dialogs can be similar considering different aspects. For instance, they can be similar in terms of textual content, that is, they address the same topics and/or the same entities, and thus forth. On the other hand, dialogs can also be similar in terms of structural features, such as the length of the dialog or the presence/absence of some kind of identified pattern of interaction.

Our method combines evidences from both textual content and structural features to compute similarity between two dialogs. Considering these evidences as complementary, we combine them based on computing the similarity of content and structure individually and combine their similarity results with the Borda Count method at ranking level, so that we avoid the burden of dealing with different scales of the features or distances if combined differently. .

2.1. Textual-based Similarity

For computing the features related to similarity based on textual content, we simply convert the dialog to a free-form text, where each dialog utterance consists of a new sentence. Then, we apply standard text mining techniques, such as stop word removal methods and the computation of Term Frequency Inverse Document Frequency (TF-IDF) measure (Weiss2012, ), being defined as:


where is the th word in the dictionary, is its frequency in the documents, is the number of documents in the dataset, and is the number of documents in which the word appears.

Once those features are computed, the cosine distance is used to compute the similarity between the TF-IDF vector of two dialogs, as:


where .

2.2. Dialog Structural Similarity

T1 - A: I would like to go from New York to Boston T2 - B: Where do you want to go from? T3 - A: I want to go from New York to Boston T4 - B: What is your origin and destination of your trip? T5 - A: I want to travel from New York to Boston T6 - B: Please, give your origin and destination.
Table 1. Example of a cycle in a dialog between two participants A and B.

For the similarity based on structural features of the dialog, we considered information about the turns, in which each turn is the contribution of one a dialog participant. We also compute metrics related to conversation cycles. A cycle is a situation when one participant of the conversation has to rephrase his/her idea until it is clear for the other participant or he/she gives up on that topic. For example, if you have the following conversation to book a plane ticket (Table 1).

The conversation above has the repetition of the same intent which is related to book a ticket. The participant has to rephrase three times the question, therefore there is a cycle of size four in the dialog. Our approach estimates the size of cycle computing the similarity of two turns, particularly we use cosine distance over the words in each turn.

In this way, using the measures described above we compute the following metrics for each dialog:

  • Total number of turns;

  • Average number of words per turn;

  • Total number of cycles;

  • Average number of turns in a cycle.

T1 - A: You know right away what you want.
T2 - B: I know right away what we, what we want.
T3 - B: I keep hearing about it.
T4 - B: I keep hearing the advertisements of it.
T5 - A: You can’t find them.
T6 - A: You can’t find them.
T7 - B: and, i thought i would really miss that.
T8 - A: I would, too,
Table 2. Dialog extract between participants A and B.
Total number of turns: 8
Average number of words per turn: 5.9
Total number of cycles: 3
Average number of turns in cycle: 2
Table 3. Metrics calculated from dialog of Table 2.

Table 2 shows a dialog piece extracted from the Switchboard dataset (SWITCHBOARD92, ) while Table 3 summarizes our metrics for that example.

2.3. Combining Textual and Structural Metrics

The combination of both type of metrics (i.e., textual content and dialog structure) is done using the Borda Count method, by considering the ranking of similarity for a given dialog. In greater detail, for each of the metrics, we first compute the distance matrix , which is an matrix where is total of dialogs, where position contains the distance between dialogs and . This matrix will be computed for the textual content metrics () and for the structural metrics (). Then, for each row in , the ranking matrix , which is also , is computed. Each cell contains the relative ranking of distance compared against all distances , where . We present Tables 4 and 5 to illustrate this process. Considering that in the first row , we assign cells , and with 1, 2 and 3, respectively. The same process is repeated for all the other rows.

0.0 0.2 0.1
0.2 0.0 0.3
0.1 0.3 0.0
Table 4. An example for a distance matrix .
1 3 2
2 1 3
2 3 1
Table 5. An example for a ranking matrix , computed from the distance matrix in Table 4

That said, for combining the results of two distance matrices denoted and , and their corresponding ranking matrices and , we need to simply compute a third distance matrix denoted , which corresponds to the sum of matrices and , and them normally compute matrix .

3. Evaluation

In this section, we describe the dataset used for our experiments and present the results of our experimental evaluation and comparison with the two other similarity metrics (e.g., only dialog textual content or only dialog structural features).

3.1. Dataset

We use the Switchboard dataset (SWITCHBOARD92, ) for evaluating our similarity metric. This dataset is composed by 1,154 dialogs, ranging from a minimum of 38 and a maximum of 509 turns. This dataset is one of the most influential spoken corpora, being applied on several different tasks. In this work, we consider the transcripts converted from the spoken dialogs to sequences of text turns.

3.2. Experiments

As described in the previous section, our approach to compute the similarity between two dialogs combines both textual content features and dialog structural features. Since there is no labeled dataset for similar dialogs, we consider an unsupervised evaluation of the metrics.

Our evaluation works as follows. Considering the textual content features and the corresponding ranking matrix , the structure features and the corresponding ranking matrix , and matrix which is the combination of the other two matrices, we compute the mean squared error (MSE) between each matrix with the purpose of understanding how these matrices can be compared. These results are presented in Table 6. We can observe that the MSE between and is much higher compared with that between and or between and . This suggests that our combination method is able to incorporate into the information in both and .

T 0.0 221591 69178
S 221591 0.0 68772
B 69178 68772 0.0
Table 6. Mean squared error between , , and .
R (1) (2) (3)
Ordered matrix 242819 218111 232473
Table 7. The MSE between the computed ranking matrices and a randomly defined ranking.

For providing a better idea about the meaning of the computed MSE values, we have computed a random matrix then applied different degrees of perturbation in the ranking, and computed the MSE between the original and perturbed matrices. The perturbation consists of randomly picking a pair of elements and swapping their values. The degree of perturbation corresponds to the number of swaps that are performed. Figure 

1 plots the MSEs of swapping from , with ranging from 1 to 5. With the values presented in Table 6, we may say that the rankings in and are likely to be very distinct since the MSE between them is above that of swapping 800 pairs. On the other hand, the MSEs of and , and that of and , are between the swapping of 100 and 200 pairs (about 1/5 to 1/3 of the list). Thus, tends to present more similarity to and , compared with the similarity of to .

Figure 1. Mean Squared Error of different levels of perturbation of a randomly-defined matrix.

For a better understanding of the results, we have conducted a qualitative analysis of one dialog. We selected dialog #90 (referred to as Original), and computed the most similar dialogs in terms of Content, which is dialog #940, in terms of Structure, which is dialog #354, and with the combination Content+Structure resulting in dialog #85. Owning to space constraints, we do not present the contents of these dialog, but we present an analysis is terms of metrics for both content and structure.

In Table 8 we list the intersection of the most frequent terms in the dialogs considered as the most similar ones, in terms of Content, Structure, and Content+Structure, in our case study. We can observe that set of terms that intersect the original dialog and the most similar in terms of content, and the set that intersects the original dialog and the most similar in terms of structure, are quite different. Nevertheless, we observe that the terms considering the most similar dialog with Content+Structure cover most of the terms in the other two sets, showing that our combination method can be effective in combining both types of metrics. To complement this analysis, we also show that the intersection of terms among all dialogs is quite small, with only the terms ’people’ and ’think’, which also reinforces the potential of our approach.

Dialogs Intersection Terms
Original and Content people, course, computer, long, research, problem, think
Original and Structure people, recycling, paper, cans, bottles, plastic, program, location, collected, bins
Original and Content+Structure people, recycling, paper, guy, bins, landfill, computer, plastic, collected, trash, separate, bottles, throw, school, cans
All people, think
Table 8. Intersection of terms between the most similar dialogs

In Table 9 we present the values found for our structural features, for each of the dialogs. Not surprisingly, we observe that the Original dialog and the most similar one for Structure present very close values. On the other hand, the most similar dialog for Content present the most different values. And the most similar in Content+Structure does not present values as close as that of Structure to the Original dialog, but still the values are quite close if compared with the Content dialog.

Dialogs turns avg words per turn Cycle avg turns per cycle
Original 61 6.5 6 2.5
Structure 65 6.5 5 2.6
Content 108 4.2 9 6.9
Structure+Content 75 5.7 6 2.8
Table 9. Structure Metrics calculated from dialogs used as example. The original and the most similar based on structure, content and structure and content respectively.

4. Conclusion and Future Work

In this work, we tackle the problem of dialog similarity. For the best of our knowledge, we are the first to address the problem in the context that uses only the text is not enough to represent similarity between dialogs. We also presented a new metric for dialog similarity based on evidence from both textual content and the dialog structure using the Switchboard dataset. We found that considering structural properties of the dialog, such a number of turns and the present of cycles can improve the similarity detection accuracy when combine of textual content features (e.g., TF-IDF).

As future work, we intend to analyze other temporal properties of the dialog such as precedence actions, other structural features such as speech acts, cycle size and investigate the use of synonyms (e.g., wordvecs) to capture meaning in the similarity. Finally, we would like to analyze the correlation between our similarity metric and human judgments.