Acquiring an accurately labeled corpus is necessary for training machine learning (ML) models in various classification applications. Labeling is an expensive and labor-intensive activity requiring annotators to understand the domain well and to label the instances one at a time. In this work, we explore the task of labeling multiple intents visually with the help of a semi-supervised clustering algorithm. The clustering algorithm helps learn an embedding representation of the training data that is well-suited for downstream labeling. In order to label, we further reduce the high dimensional representation using the UMAPMcInnes et al. (2018). Since utterances are short, uncovering their semantic meaning to group them together is very challenging. SBERT Reimers and Gurevych (2019) showed that out-of-the-box BERT Devlin et al. (2018)
maps sentences to a vector space that is not very suitable to be used with common measures like cosine-similarity and euclidean distances. This happens because in the BERT network, there is no independent sentence embedding computation, which makes it difficult to derive sentence embeddings. Researchers utilize the mean pooling of word embeddings as an approximate measure of the sentence embedding. However, results show that this practice yields inappropriate sentence embeddings that are often worse than averaging GloVe embeddingsPennington et al. (2014); Reimers and Gurevych (2019). Many researchers have developed sentence embedding methods: Skip-Thought Kiros et al. (2015), InferSent Conneau et al. (2017), USE Cer et al. (2018), SBERT Reimers and Gurevych (2019). State-of-the-art SBERT adds a pooling operation to the output of BERT to derive a fixed-sized sentence embedding and fine-tunes a Siamese network on the sentence-pairs from the NLI Bowman et al. (2015); Williams et al. (2017) and STSb Cer et al. (2017) datasets.
The Deep Aligned Clustering (DAC) Zhang et al. (2021)
introduced an effective method for clustering and discovering new intents. DAC transfers the prior knowledge of a limited number of known intents and incorporates a technique to align cluster centroids in successive training epochs. The limited known intents are used to pre-train the model. The authors use the pre-trained BERT modelDevlin et al. (2018)
to extract deep intent features, then pre-train the model with a randomly selected subset of labeled data. The pre-trained parameters are used to obtain well-initialized intent representations. K-Means clustering is performed on the extracted intent features along with a method to estimate the number of clusters and the alignment strategy to obtain the final cluster assignments. The K-Means algorithm selects cluster centroids that minimize the Euclidean distance within the cluster. Due to this Euclidean distance optimization, clustering using the SBERT model to extract feature embeddings naturally outperforms other embedding methods. In our work, we have extended the DAC algorithm with the SBERT as an embedding backbone for clustering of utterances.
In semi-supervised learning, the seed set is selected using a sampling strategy: “A simple random sample of sizeconsists of individuals from the population chosen such that every set of individuals has an equal chance to be the sample actually selected.” Moore and McCabe (1989)
. However, these sample subsets may not represent the original data adequately because randomization methods do not exploit the correlations in the original population. In a stratified random sample, the population is classified first into groups (called strata) with similar characteristics. Then a simple random sample is chosen from each strata separately. These simple random samples are combined to form the overall sample. Stratified sampling can help ensure that there are enough observations within each strata to make meaningful inferences. DAC uses the Random Sampling method for seed selection. In this work, we have explored a couple of stratified sampling approaches for seed selection in hope to mitigate the limitations of random sampling and improve the clustering outcome.
Another issue we address in this work is class sample imbalance. Seed selection generally yields an imbalanced dataset, which in turn impairs the predictive capability of the classification algorithms Douzas et al. (2018). Some methods manipulate the training data, aiming to change the class distribution towards a more balanced one by undersampling or oversampling Kotsiantis et al. (2006); Galar et al. (2011). SMOTE Chawla et al. (2002)
is a popular oversampling technique proposed to improve random oversampling. In one variant of SMOTE, borderline minority instances are heuristically selected and linearly interpolated to create synthetic samples. In this work, we take inspiration from the SMOTE method and choose borderline minority instances and paraphrase them using a Sequence to Sequence Paraphrasing model. The paraphrases provide natural and meaningful augmentations of the dataset that are not synthetic.
Previous work has shown that data augmentation can boost performance on text classification tasks Barzilay and McKeown (2001); Dolan and Brockett (2005); Lan et al. (2017); Hu et al. (2019). Wieting et al. (2017)
used Neural Machine Translation (NMT)Sutskever et al. (2014) to translate the non-English side of the parallel text to get English-English paraphrase pairs. This method has been scaled to generate large paraphrase corpora Wieting and Gimpel (2018)
. Prior work in learning paraphrases has used autoencodersSocher et al. (2011), encoder-decoder architectures as in BART Lewis et al. (2019), and other learning frameworks such as NMT Sokolov and Filimonov (2020). Data augmentation using paraphrasing is a simple yet effective strategy that we explored in this work to improve the clustering.
For interactive visual labeling of utterances, we build up from the learnt embedding representation of the data and fine-tune it using the clustering. DAC learns to cluster with a weak self-supervised signal to update its representation and to optimize both local (via K-Means) and global information (via cluster alignment). This results in an optimized intent-level feature representation. This high dimensional latent representation can be reduced to 2-3 dimensions using the Uniform Manifold Approximation and Projection (UMAP) McInnes et al. (2018). We use Rasa WhatLies111rasahq.github.io/whatlies/ library Warmerdam et al. (2020) to extract the UMAP embeddings. For interactive labeling, we utilize an interactive visualization library called Human Learn222koaning.github.io/human-learn/ Warmerdam et al. (2021) that allows us to draw decision boundaries on a plot. By building on top of the work of Rasa Bulk Labelling333github.com/RasaHQ/rasalit/blob/main/notebooks/bulk-labelling/bulk-labelling-ui.ipynb UI Warmerdam (2020); Bokeh Development Team (2018), we augment the interface with our learnt representation for interactive labeling. Although we focus on NLU, other studies like ‘Conversation Learner’ Shukla et al. (2020) focus on interactive dialogue managers (DM) with human-in-the-loop annotations of dialogue data via machine teaching. Note also that although the majority of task-oriented SDS still involves defining intents/entities, there are recent examples that argue for a richer target representation than the classical intent/entity model, such as SMCalFlow Andreas et al. (2020).
Figure 1 describes the semi-supervised labeling process. We start with the unlabeled utterance corpus and apply seed sampling methods to select a small subset of the corpus. Once the selected subset is manually labeled, we address the data imbalance with our paraphrase-based minority oversampling method. We can also augment the labeled corpus with paraphrasing to provide more data for the clustering process. The DAC algorithm is applied with improved embeddings to extract the utterance representation for interactive labeling.
|Dataset||#Classes||#Train||#Valid||#Test||Vocab||Length (max / mean)|
|CLINC||150||18,000||2,250||2,250||7,283||28 / 8.31|
|BANKING||77||9,003||1,000||3,080||5,028||79 / 11.91|
|KidSpace||19||1,289||445||419||2,581||74 / 5.10|
2.1 Sentence Representation
For sentence representation, we use the HuggingFace Transformers model BERT-base-nli-stsb-mean-tokens444https://huggingface.co/sentence-transformers/bert-base-nli-stsb-mean-tokens/tree/main
. This model was first fine-tuned on a combination of Stanford Natural Language Inference (SNLI)Bowman et al. (2015) (570K sentence-pairs with labels contradiction, entailment, and neutral) and Multi-Genre Natural Language Inference Williams et al. (2017) (430K diverse sentence-pairs with same labels as SNLI) datasets, then on Semantic Textual Similarity benchmark (STSb) Cer et al. (2017) (provide labels between 0 and 5 on the semantic relatedness of sentence pairs) training set. This model achieves a performance of 85.14 (Spearman’s rank correlation between the cosine-similarity of the sentence embeddings and the gold labels) on STSb regression evaluation. For context, the average BERT embeddings achieve a performance of 46.35 on this evaluation Reimers and Gurevych (2019).
2.2 Seed Selection
We explore two selection and sampling strategies for seed selection as follows:
Cluster-based Selection (CB): In this method, we apply K-Means clustering on the utterances to partition the data into seed number of subsets. For example, if 10% of the data has 100 utterances, this method creates 100 clusters from the dataset. We then pick the centroid’s nearest neighbor as part of the seed set for all the clusters. The naive intuition for this strategy is that it would create a large number of clusters spread all over the data distribution (
instances per cluster on average for uniformly distributed instances).
Predicted Cluster Sampling (PCS): This is a stratified sampling method where we first predict the number of clusters and then sample instances from each cluster. We use the cluster size estimation method from the DAC work as follows: K-Means is performed with a large (initialized with twice the ground truth number of classes). The assumption is that real clusters tend to be dense and the cluster mean size threshold is assumed to be N/K’.
where is the size of the th produced cluster, and is an indicator function. It outputs 1 if condition is satisfied, and outputs 0 if not. The method seems to perform well as reported in DAC work.
2.3 Data Balancing and Augmentation
For handling data imbalance, we propose a paraphrasing-based method to over-sample the minority classes. The method is described as follows:
For every instance () in the minority class , we calculate its nearest neighbors from the whole training set . The number of majority examples among the nearest neighbors is denoted by ().
If , i.e., all the nearest neighbors of are majority examples, is considered to be noise and is not operated in the following steps. If , namely the number of ’s majority nearest neighbors is larger than the number of its minority ones, is considered to be easily misclassified and put into a set DANGER. If , is safe and does not need to participate in the following steps.
The examples in DANGER are the borderline data of the minority class , and we can see that DANGER . We set DANGER ,
For each borderline data (that can be easily misclassified), we paraphrase the instance. For paraphrasing, we fine-tuned the BART Sequence to Sequence model Lewis et al. (2019) on a combination of 3 datasets: ParaNMT Wieting and Gimpel (2018), PAWS Zhang et al. (2019); Yang et al. (2019), and the MSRP corpus Dolan and Brockett (2005).
We classify the paraphrased sample with a RoBERTa Liu et al. (2019) based classifier fine-tuned on the labeled data and only add the instance if the classifier predicts the same label as the minority instance. We call this the ‘ParaMote’ method in our experiments. Without this last step (5), we call this overall approach our ‘Paraphrasing’ method.
We use the Paraphrasing model and the classifier as a data augmentation method to augment the labeled training data (refer to as ‘Aug’ in our experiments).
Note that we augment the paraphrased sample if it belongs to the same minority class (‘ParaMote’) as we do not want to inject noise while solving the data imbalance problem. The opposite is also possible for other purposes such as generating semantically similar adversaries Ribeiro et al. (2018).
3 Experimental Results
To conduct our experiments, we use the BANKING Casanueva et al. (2020) and CLINC Larson et al. (2019) datasets similar to the DAC work Zhang et al. (2021). We also use another dataset called KidSpace that includes utterances from a Multimodal Learning Application for 5-to-8 years-old children Sahay et al. (2019); Anderson et al. (2018). We hope to utilize this system to label future utterances into relevant intents. Table 1 shows the statistics of the 3 datasets where 25% random classes are kept unseen at pre-training.
|Dataset||BERT||Data Bal/Aug||Seed Selection||NMI||ARI||ACC|
3.1 Sentence Representation
The choice of pre-trained embeddings has the largest impact on the clustering results. We observe huge performance gains for the single domain KidSpace and BANKING datasets. For the multi-domain and diverse CLINC dataset with the largest number of intents, we saw a slight degradation in performance. While this needs further investigation, we believe the dataset is diverse enough and already has very high clustering scores and that the improved sentence representations may not be helping further.
3.2 Seed Selection
Seed selection is an important problem for limited data tasks. Law of large numbers does not hold and random sampling strategy may lead to larger variance in outcomes. We explored Cluster-based Selection (CB) and Predicted Cluster Sampling (PCS) besides other techniques (see detailed results in AppendixA.1
). Our results trend towards smaller standard deviations and similar performance for the BANKING and CLINC datasets with the PCS method. Surprisingly, this does not hold for the KidSpace dataset that needs further investigation. Figure2 shows the KidSpace data visualised with various colored clusters and centroids. While we non-randomly choose seed data, we still hide 25% of the classes at random (to enable unknown intent discovery). Our recommendation is to use PCS if one cannot run the training multiple times for certain situations to have less variance in results.
3.3 Data Balancing for Imbalanced Data
Figure 3 shows the histogram for the seed data, which is highly imbalanced and may adversely impact the clustering performance. We apply Paraphrasing and ParaMote methods to balance the data. Paraphrasing almost always improves the performance while the additional classifier to check for class-label consistency (ParaMote) does not help.
3.4 Data Augmentation
We augmented the entire labeled data including the majority class using Paraphrasing (with class-label consistency) by 3x in our experiments. We aimed to understand if this could help get a better pre-trained model that could eventually improve the clustering outcome. We do not observe any performance gains with the augmentation process.
3.5 Interactive Data Labeling
Our goal in this work is to develop a well-segmented learnt representation of the data with deep clustering and then to use the learnt representation to enable fast visual labeling. Figure 4 shows the two clustered representations, one without pre-training and BERT-base embedding while the other with a fine-tuned sentence BERT representation and pre-training. We can obtain well separated visual clusters using the latter approach. We use the drawing library human-learn to visually label the data. Figure 5 shows selected region of the data with various labels and class confusion. We notice that this representation not only helps with the labeling but also helps with correcting the labels and identify utterances that belong to multiple classes which cannot be easily segmented. For example, ‘children-valid-answer’ and ‘children-invalid-grow’ (invalid answers) contain semantically similar content depending on the game logic of the interaction. We perhaps need to group these together and use an alternative logic for implementing game semantics.
In this exploration, we have used fine-tuned sentence BERT model to significantly improve the clustering performance. Predicted Cluster Sampling strategy for seed data selection seems to be a promising approach with possibly lower variance in clustering performance for smaller data labeling tasks. Paraphrasing-based data imbalance handling slightly improves the clustering performance as well. Finally, we have utilized the learnt representation to develop a visual intent labeling system.
- Kid space: interactive learning in a smart environment. In Proceedings of the Group Interaction Frontiers in Technology, GIFT’18, New York, NY, USA. External Links: Cited by: §3.
- Task-Oriented Dialogue as Dataflow Synthesis. Transactions of the Association for Computational Linguistics 8, pp. 556–571. External Links: Cited by: §1.
- Extracting paraphrases from a parallel corpus. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, pp. 50–57. External Links: Cited by: §1.
- Bokeh: python library for interactive visualization. External Links: Cited by: §1.
A large annotated corpus for learning natural language inference.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 632–642. External Links: Cited by: §1, §2.1.
- Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, Online, pp. 38–45. External Links: Cited by: §3.
- SemEval-2017 task 1: semantic textual similarity - multilingual and cross-lingual focused evaluation. CoRR abs/1708.00055. External Links: Cited by: §1, §2.1.
- Universal sentence encoder. CoRR abs/1803.11175. External Links: Cited by: §1.
SMOTE: synthetic minority over-sampling technique.
Journal of artificial intelligence research16, pp. 321–357. Cited by: §1.
- Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 670–680. External Links: Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Cited by: §1, §1.
- Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), External Links: Cited by: §1, item 4.
- Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Information Sciences 465, pp. 1–20. External Links: Cited by: §1.
- A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42 (4), pp. 463–484. Cited by: §1.
- Large-scale, diverse, paraphrastic bitexts via sampling and clustering. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong Kong, China, pp. 44–54. External Links: Cited by: §1.
- Skip-thought vectors. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. . External Links: Cited by: §1.
- Handling imbalanced datasets: a review. GESTS International Transactions on Computer Science and Engineering 30 (1), pp. 25–36. Cited by: §1.
- A continuously growing dataset of sentential paraphrases. CoRR abs/1708.00391. External Links: Cited by: §1.
- An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1311–1316. External Links: Cited by: §3.
BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. CoRR abs/1910.13461. External Links: Cited by: §1, item 4.
- RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Cited by: item 5.
- UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv e-prints. External Links: Cited by: §1, §1.
- Introduction to the practice of statistics. Cited by: §1.
- GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Cited by: §1.
- Sentence-bert: sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, External Links: Cited by: §1, §2.1.
- Semantically equivalent adversarial rules for debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 856–865. External Links: Cited by: §2.3.
- Modeling intent, dialog policies and response adaptation for goal-oriented interactions. In Proceedings of the 23rd Workshop on the Semantics and Pragmatics of Dialogue, London, United Kingdom. External Links: Cited by: §3.
- Conversation Learner - a machine teaching tool for building dialog managers for task-oriented dialog systems. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, pp. 343–349. External Links: Cited by: §1.
- Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, Red Hook, NY, USA, pp. 801–809. External Links: Cited by: §1.
- Neural machine translation for paraphrase generation. arXiv preprint arXiv:2006.14223. Cited by: §1.
Sequence to sequence learning with neural networks. CoRR abs/1409.3215. External Links: Cited by: §1.
- Koaning/human-learn: 0.2.5 External Links: Cited by: §1.
- Rasa. Note: The relevant notebook can be found on GitHub: https://github.com/RasaHQ/rasalit/blob/main/notebooks/bulk-labelling/bulk-labelling-ui.ipynb External Links: Cited by: §1.
Going beyond T-SNE: exposing whatlies in text embeddings.
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), Online, pp. 52–60. External Links: Cited by: §1.
- ParaNMT-50M: pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 451–462. External Links: Cited by: §1, item 4.
- Learning paraphrastic sentence embeddings from back-translated bitext. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Denmark, pp. 274–285. External Links: Cited by: §1.
- A broad-coverage challenge corpus for sentence understanding through inference. CoRR abs/1704.05426. External Links: Cited by: §1, §2.1.
- PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification. In Proc. of EMNLP, Cited by: item 4.
- Discovering new intents with deep aligned clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: Semi-supervised Interactive Intent Labeling, §1, §3.
- PAWS: Paraphrase Adversaries from Word Scrambling. In Proc. of NAACL, Cited by: item 4.
Appendix A Appendix
a.1 Additional Experimental Results
In addition to the Cluster-based Selection (CB) and Predicted Cluster Sampling (PCS) methods, we have explored other seed selection techniques compared with the Random Sampling. These are the Known Cluster-based Selection (KCB) and Cluster-based Sentence Embedding (CSE) methods. KCB is a variation of CB where we cluster into a number of known labels’ subsets (based on known class ratio) and pick up certain % of data (based on labeled ratio) from each cluster’s data points. CSE, on the other hand, is another variation of CB where, instead of BERT word embeddings as the pre-trained representations, we use the sentence embeddings model before running K-Means (the rest is the same as the CB method).
Table 3 presents detailed clustering performance results on three datasets using all five seed selection methods we explored, with varying labeled ratio and BERT embeddings (standard/BERT-base vs. sentence/SBERT models). In Table 4, we expand our analysis on the KidSpace dataset with data balancing/augmentation approaches on top of these five seed selection methods, once again with standard/sentence BERT embeddings. Table 5 presents additional results on the BANKING dataset to compare data balancing/augmentation methods on top of standard vs. the sentence BERT representations.
|Dataset||BERT||Data Bal/Aug||Seed Selection||labeled_ratio||NMI||ARI||ACC|
|Dataset||BERT||Data Bal/Aug||Seed Selection||labeled_ratio||NMI||ARI||ACC|