TSCAN : Dialog Structure discovery using SCAN

07/13/2021 ∙ by Apurba Nath, et al. ∙ 0

Can we discover dialog structure by dividing utterances into labelled clusters. Can these labels be generated from the data. Typically for dialogs we need an ontology and use that to discover structure, however by using unsupervised classification and self-labelling we are able to intuit this structure without any labels or ontology. In this paper we apply SCAN (Semantic Clustering using Nearest Neighbors) to dialog data. We used BERT for pretext task and an adaptation of SCAN for clustering and self labeling. These clusters are used to identify transition probabilities and create the dialog structure. The self-labelling method used for SCAN makes these structures interpretable as every cluster has a label. As the approach is unsupervised, evaluation metrics is a challenge, we use statistical measures as proxies for structure quality



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and prior work

Dialog structure discovery is an important problem given the increased efforts in automation of text and voice response systems. Unlike the simulated dialogs or human bot interactions, human human interactions are richer (larger vocab and variation) and have more number of turns. For example typical dialog datasets created from SimDial have less than 1.5k vocab spread with 20 n-grams covering majority of generated responses. Compared to this our internal human human task oriented dialog incldues over 5k of common closed (proper nouns excluded) vocab with the most common 20 n-grams failing to cover even 1% of utterances.

Many approaches rely on generational models which are trained on the dialog data e.g VRNN approach (Shi, Zhao) [2] or DVAE-GNN (Xu, Che) [3]. There are also approaches using transformers like BERT models. In our experience BERT when trained on large in-domain data captures semantic information abundantly, we can cluster on these embeddings but balance and interpretability of these clusters is still a challenge.

On image classification without labels SCAN [1] has achieved great results. Their approach comprises of obtaining semantically meaningful features, learning a clustering approach and then self-labelling for interpretable clusters. They use image transforms and nearest neighbors in this work. They use confidence and consistency both as part of their objective function while training the clustering model which creates balanced clusters. It is also not negatively impacted by overclustering 111in fact in our experiments we depend on overclustering.

2 Our approach

Each dialog is made of T turns where is the agent utterance at t-th dialogue turn and the user utterance 222

Unlike other approaches we do not classify the user utterance as a response. In our datasets we have seen multiple cases where user utterance is a query and the agent utterance is the response

. These dialogs are task oriented and may have multiple exchanges (multiple tasks) in the same dialog. Our goal is to eventually find out any correlation between and and and . We try to first reduce the size of the space (because of vocab variety) by assiging the utterances to clusters. With a 20 state cluster, now this becomes a problem of matching the clusters among each other. For example assuming belongs to agent cluster and user utterance belongs to we can group them with any other turn which similarly have and . We create transition probabilities between the cluster combinations , these transition probabilities are then used for dialog states.

A simplified version of these steps are

  • Use in-domain trained BERT for semantic embeddings.

  • Train SCAN model with nearest neighbors on 10k agent utterances and user utterances

  • Create clusters using this model and apply self-labels on it

  • For each Dialog turn assign the agent cluster and customer cluster

  • Create a transition map between agent and customer turn and customer to next agent turn

  • Create dialog flows with these transition states, each cluster is represented by it’s equivalent label

2.1 Models:

The bert model needs to satisfy the equation (1) of SCAN paper, replicated here for convenience


finetuning or training on large volume of in-domain data helps us create such a model. Any MLM evaluation task can be used to check the semantic quality of the model.

The SCAN model needs to satisy the equation (2) of the SCAN paper, a simlified form of that equation is


consistency loss is BCE between anchors and neighbors while entropy loss is mean of anchors probability. The entropy constituent helps in balancing the distributions within the clusters.

2.2 Our experiences

Unlike original SCAN implementation we do not use transformations or augmentation, instead we rely on the variety of data to provide the relevant neighbors. We also do not build a pretext model but use a BERT model trained on in-domain data for the same. Our experiments show that inspite of these deviations from the SCAN approach we are able to create a interpretable dialog structure from the balanced well defined clusters created by SCAN. We use two statistical measures as evaluation metrics to understand the cluster quality and our experiments show that TSCAN (text SCAN) does better than K-means on both these measures.

2.3 Evalaution Metrics

The goal of clustering is to evenly balance the utterances between the clusters. This means we should not have any cluster that is too big. To compute the distribution score, we use


We want the clusters to be balanced, that means each cluster should have nearly the same number of members. A good measure of the same is


where x is the ratio of members vs total elements. Though this number is not comparable across cluster sizes, within a cluster size it is a good indicator of the distribtuion. For comparison across cluster sizes we can use deviation from ideal distribution.


We expect similar utterances to end up in the same cluster. As we already have some pre-trained intent models, we can check that utterances with the same intent end up together. We want the number of clusters to be as low as possible . A good measure of togetherness is the mean and standard deviation of cluster membership. For exmaple in case we have a greeting intent, we would want all the greeting intents to end up in the same cluster. A scenario where it is spread between 3 different clusters out of 20 is better than where it is spread between 8. Mean and standard deviation of these two scenarios give a good indication of the distribution.

3 Results

For internal dataset, we were able to arrive at interpretable dialog states.

For the clustering approach, for a 20 cluster SCAN vs K-means approach. While K-means shows a distribution score of -2.64, SCAN achives -2.77 while an ideal distribution is -2.995

Similarly for two intents greeting and payment_inquiry, the results were

intent: payment_inquiry with K-means nobs=8, minmax=(6, 43), mean=12.12, variance=162.98, skewness=2.07, kurtosis=2.63 with Scan nobs=6, minmax=(6, 60), mean=16.16, variance=468.97, skewness=1.73, kurtosis=1.07 intent: greeting with K-means nobs=6, minmax=(0, 91), mean=16.17, variance=1347.77, skewness=1.78, kurtosis=1.18 with Scan nobs=5, minmax=(0, 95), mean=19.4, variance=1786.30, skewness=1.50, kurtosis=0.25

These clusters were then used to map transition probabilities, all transitions with probability less than 0.1 were ignored. Inspite of these 80% of agent cluster groups transitioned to less than 3 user cluster groups. Similarly each user cluster group mapped to two agent cluster groups. The three user cluster groups for agent cluster at turns is expected as there is a rich variety of answers and most of the questions are open questions rather than yes, no type closed questions.