Nowadays, Language Models (LM) and more in general NLP methodologies are becoming a pivotal element in modern communication systems. In dialogue systems, understanding the actual intention behind a conversation initiated by a user is fundamental, and so it is understanding user’s intents underlying each dialogue utterance Ni et al. (2021). Recently, there have been a growing interest in such applications and, in this context, multiple datasets, as well as challenges, have been proposed amazon-research (2022); Casanueva et al. (2020); Zhang et al. (2021). Unfortunately, most of the available systems used in production for common business use-cases are based on models trained on a finite-set of possible intents, with limited or no possibility to generalize to novel unseen intents Qi et al. (2020). This is a major blocker, since most applications do not fit this finite set of intents assumption, but they instead require a constant re-evaluation based on users’ inputs and feedback. In order to build better dialogue systems, closely fitting users’ needs, it becomes imperative to dynamically expand the set of available intents. Besides better matching user expectation, this also enables the definition of new common intents that can slowly emerge from a pool of different users, a key aspect to build systems that are more resilient over time and can follow trends appropriately. Current supervised techniques unfortunately fall short in tackling the aforementioned challenge, since they usually lack the capacity to discover novel intents. These models are trained by considering a finite set of classes and can’t generalize to real world application Larson et al. (2019). Here, we propose a system which combines dependency parsing Honnibal and Johnson (2015); Nivre and Nilsson (2005); ExplosionAI (2015) to extract potential intents from a single utterance and a zero-shot classification approach based on a Transformer Vaswani et al. (2017); Devlin et al. (2018) fine-tuned for NLI (Natural Language Inference) Xia et al. (2018); Yin et al. (2019) to select the intent that is best fitting the utterance in a zero-shot setting. In the NLI fine-tuning we leverage Adapters Houlsby et al. (2019); Pfeiffer et al. (2020) to significantly reduce memory and time requirements keeping base model parameters frozen. Our approach is designed keeping in mind a production setting where automatic intent discovery is fundamental, e.g., customer care chatbots, to ensure a smooth interaction with the users. In Figure 1 we show how Z-BERT-A pipeline fits in an intent detection system where classification models being used for intent detection are not able to correctly or confidently enough detect an intent for the input utterance from the user currently querying the system.
There have been various efforts aiming at finding novel and unseen intents. A popular approach to address the problem of determining whether an utterance is fitting the existing set of intents consists by casting the task as a binary classification problem, and then applying a zero-shot techniques in order to actually determine the new intent Xia et al. (2018); Siddique et al. (2021); Yan et al. (2020). Liu et al. (2022) proposed an approach leveraging Transformers-based architectures, while the majority relied on RNN architectures like LSTMs Xia et al. (2018). For the problem we are focusing here, i.e., finding novel intents, there are have been two interesting attempts to propose a pipeline for generation and extraction Vedula et al. (2019); Liu et al. (2021). Liu et al. (2021)
addressed the new intent discovery as a clustering problem, proposing an adaptation of K-means clustering. Here, dependency parsing is used to extract from each cluster a mean ACTION-OBJECT pair representing the common emerging intent for that particular cluster. Recently, the increasing attention on Large Language Models (LLMs) that exhibit zero-shot generalizationChowdhery et al. (2022); Sanh et al. (2021); Wang and Komatsuzaki (2021), makes them interesting candidates for unknown intent detection. Indeed, a GPT-J Wang and Komatsuzaki (2021)-based approach has been proposed. Leveraging a fine-tuned version of GPT-3 Brown et al. (2020), they have been able to generate the intents directly from the input utterance. While the solution is extremely elegant, it is not an ideal fit for many real-world use-cases where a model of this size (starting from 6B billion parameters and up) is not always deployable in practice, e.g., in an on-premise settings with various hardware constraints. Moreover, the results presented are looking at a few-shot setting Brown et al. (2020) without explicitly investigating the much more challenging zero-shot setting.
In intent discovery, we aim at extracting from a single utterance a set of potentially novel intents and automatically determine the best fitting one for the considered utterance. Intent discovery can be tackled as a Natural Language Inference (NLI) problem where we can rely on a language model to predict the entailment between the utterance () and a set of hypotheses based on the candidate intents (). Where is a function used to extract hypotheses from the set of potential intents . As previously shown by Xian et al. (2018) using NLI models for zero-shot classification represents an effective approach in problems where the set of candidate intents is known. In practice, the classification problem is cast as an inference task where a combination of a premise and a hypothesis are associated to a selection of three possible classes: entailment, neutral and contradiction. Yin et al. (2019)
shown how this approach allows considering an input hypothesis based on an unseen class and generate from the input premise-hypothesis pair a probability distribution describing the entailment, hence a score that can be linked to the input text to be assigned to the novel class. While this technique is extremely flexible, and in principle can handle any association between utterance and candidate intent, determining good candidates based on the analyzed utterance remains a major challenge.
Herein, we focus on building a pipeline able to handle unseen classes at inference time. In this context, we have the need to both generating a set of candidate intents from the considered utterance and classify the provided input against the new set of candidate intents. We chose to tackle the problem by implementing a pipeline in two stages. In the first stage, we leverage a dependency parserHonnibal and Johnson (2015); Nivre and Nilsson (2005) to extract a set of potential intents by exploiting specific arc dependencies between the words in the utterance. In the second stage, we leverage the set of potential intents as candidate classes for the utterance intent classification problem using a zero-shot approach Xian et al. (2018) based on NLI relying on a BERT-based model Devlin et al. (2018). The model is tuned with Adapters Houlsby et al. (2019) for the NLI task (BERT-A), and is then prompted with premise-hypothesis pairs for zero-shot classification on the candidate intents completing the Z-BERT-A pipeline. The full pipeline for zero-shot is implemented using the Hugging Face pipeline API from the transformers library Wolf et al. (2019).
Being able to define a set of potential intents from the input utterance is key in intent discovery. In order to provide this set of potentially unseen candidates, we chose to exploit the dependency parser from spaCy Honnibal and Johnson (2015); Honnibal et al. (2020), relying on model en_core_web_trf from spaCy-transformers ExplosionAI (2019). To generate the set of potential novel intents, we extract from an input sentence through the dependency parser a pair of words obtained by searching for specific Arc-Relations (AR) in the dependency tree of the parsed sentence.
|PRONM||pronoun, all kind of pronouns|
Since an intent is usually composed by an action-object pair Vedula et al. (2019); Liu et al. (2021), we exploit this pattern in order to search for DOBJ, compound and AMOD arc relations. We perform a four level detection, which means finding the four main relations which can generate a base-intent. Once these relations, or a subset of them, are found, we add the pairs composed by (VERB, NOUN) and (ADJ, PRONM) with the most out/in going arcs. We refer to Table 1 for a complete definition of the AR and Part-of-Speech (POS) tags considered. The extracted potential intents are then lemmatized using NLTK Loper and Bird (2002). Lemmatization is applied to verb and noun independently. The lemmatized intents are then used as classes for the zero-shot classifier based on our model.
Algorithm 1 details in pseudocode the pipeline for intent generation.
The generated potential intents are fed to the zero-shot BERT-based classifier implemented using NLI that scores the entailment between the utterance used as a premise and the hypothesis based on the intent. The intent related to the pair with the highest score is selected as best fitting for the input utterance. The scores are computed using sentence embedding vectorsReimers and Gurevych (2019).
|Utterance||Extracted key-phrase||Wordnet Synset definition (SD)||Generated hypothesis|
|where do you support?||support||
|this text is about SD|
|this text is about SD|
|this text is about SD|
We consider two datasets in our analysis: SNLIBowman et al. (2015) and Banking77-OOSCasanueva et al. (2020); Zhang et al. (2021):
The SNLI corpus Bowman et al. (2015)
is a collection of 570k human-written English sentence pairs manually labeled as entailment, contradiction, and neutral. It is used for natural language inference (NLI), also known as recognizing textual entailment (RTE). The dataset comes with a split: 550152 samples for training, 10000 samples for validation and 10000 samples for testing. Each sample is composed of a premise, a hypothesis and a corresponding label indicating whether the premise and the hypothesis represent an entailment. The label can be set to one of the followings: entailment, contradiction, or neutral.
Banking77-OOS Casanueva et al. (2020); Zhang et al. (2021) is an intent classification dataset composed of online banking queries annotated with their corresponding intents. It provides a very fine-grained set of intents in the banking domain. It comprises 13,083 customer service queries labeled with 77 intents. It focuses on fine-grained single-domain intent detection. Of these 77 intents, Banking77-OOS includes 50 in-scope intents, and the ID-OOS queries are built up based on 27 held-out in-scope intents.
We also explore the effects of pretraining of an NLI adaptation of Banking77 Yin et al. (2019). To investigate the impact of pretraining on similar data we have extended the Banking77 dataset casting the intent classification task as NLI. To achieve this, we consider the input utterance as premise and we extract using KeyBERT Sharma and Li (2019) via self-attention the most relevant word associated to it. The word is then used to generate an entailed hypothesis based on the corresponding synset definition from WordNet via NLTK Loper and Bird (2002); Miller (1995). Exemplar samples are reported in Table 2. For the hypotheses that are not considered entailed we simply repeat the procedure for randomly sampled unrelated words. This process enabled us to consider the training split of Banking77-OOS for adaptive fine-tuning of the NLI model component. We call this generated dataset Banking77-OOS-NLI.
We fine-tuned two versions of BERT-A (BERT-based transformer with Adapters). The first version is trained for NLI on the SNLI dataset. The second version also considers the previously introduced Banking77-OOS-NLI. The training procedures are only optimizing the parameters for the added Adapter layers to minimize training time and memory footprint. By freezing all the original layers and letting the model be trained only on the adaptive layers, we end up with 896066 trainable parameters. All training runs relied on the AdamW Loshchilov and Hutter (2017) optimizer with a learning rate set to
and a warm-up scheduler. The models have been fine-tuned for a total of 6 epochs using early stopping.
|Z-BERT-A (SNLI)||0.478 0.003||0.5||-|
|Z-BERT-A (Banking77-OOS-NLI)||0.492 0.004||0.546 0.011||-|
|Prompt name||Prompt text|
|prompt 1||Considering this utterance: [utterance]. What is the intent that best describes it?|
|prompt 4||[utterance] Choose the most suitable intent based on the above utterance. Options: [potential intents]|
In order to evaluate the performance and the results of the Z-BERT-A pipeline, we first analyze the accuracy of the BERT-A component and evaluate its results on an NLI task using accuracy, precision and recall. Afterwards, we compare its result on the zero-shot classification task with other available models on the same Banking77 split using accuracy. In this initial evaluation, the intents are known. The baselines considered in this setting are: BART0Lin et al. (2022), a multitask model with 406 million parameters based on Bart-large Lewis et al. (2019) based on prompt training; and two flavours of Zero-Shot DDN (ZS-DNN) Kumar et al. (2017) with both encoders Universal Sentence Encoder (USE) Cer et al. (2018) and SBERT Reimers and Gurevych (2019).
In the unknown intent case, we compare the Z-BERT-A pipeline against a set of zero-shot baselines based on various pretrained transformers. As baselines we include bart-large-mnli Yin et al. (2019) as it has shown interesting performance in zero-shot sequence classification. We used this model as an alternative to our classification method, but in this case we maintain our dependency parsing strategy for the intent generation. As hypothesis for the NLI based classification, we used the phrase: “This example is ”. Furthermore, given that very large LMs have demonstrated remarkable zero-shot capabilities in a plethora of tasks. We added two of the most recent ones, namely T0 Sanh et al. (2021) and GPT-J Wang and Komatsuzaki (2021), to the baseline list. In such models, the utilized template prompt defines the task of interest. We examined whether they can serve the end-to-end intent extraction (intent generation and classification) in a completely unsupervised and zero-shot setting or given already generated intents using our dependency parsing method to classify them. In the former case, the given input is just the utterance of interest, while in the latter case the provided input includes the utterance and the possible intents. In both cases, the generated output is considered the extracted intent.
Since in this setting the intents generated can’t be matched perfectly with the held-out ones, we chose to measure the performance using a semantic similarity metric based on the cosine-similarity between the sentence embeddings of the ground-truth intents and the respective generated ones Vedula et al. (2019). To set a decision boundary, we rely on a threshold based on distributional properties of the computed similarities. The threshold is defined in Equation 3.
is an arbitrary parameter to control the variance impact which we set to 0.5 in our study.
This full pipeline evaluation has been repeated five times to evaluate the stability of the results.
Firstly, we evaluate the performance of BERT-A on the NLI task, see Table 3. The accuracy precision and recall achieved confirm the quality of the pretrained model highlighting the impact of fine-tunining on Banking77-OOS-NLI.
Table 4 shows how the BERT-A component improves results over the majority of the baselines considered in terms of accuracy in the known intent scenario. Remarkably, the BERT-A version fine-tuned on Banking77-OOS-NLI outperforms all the considered baselines.
Finally, we evaluate Z-BERT-A on the unknown intent discovery task. Table 5 reports the performance of BERT-A fine-tuned on SNLI and Banking77-OOS-NLI in comparison with a selection of zero-shot baselines for intent discovery. Both flavours of Z-BERT-A outperforms the considered baselines by a consistent margin.
Table 6) reports the prompts used for GTP-J and T0 inference. For prompts 1 and 2 we let the models generate the intent without providing a set of possible options. Prompts 3 and 4 instead contain the candidate intents extracted using the first stage of the Z-BERT-A pipeline. Figure 3 reports the average cosine similarity between the generated intents for each of the ground-truth intents.
It is interesting to observe how the generated intents are similar to their semantic ground-truth counterparts.
|Ground-truth intent||Intent from Z-BERT-A|
To appreaciate the quality of the generated intents, in Table 7 we report some examples of unseen ground-truth intents and the corresponding Z-BERT-A predictions.
Conclusions and Future Work
We proposed Z-BERT-A, a pipeline for zero-shot prediction of unseen intents from utterances. We performed a two-fold evaluation. First, we showed how our BERT-based model fine-tuned with Adapters on NLI is able to outperform a selection of baselines on the prediction of known intents in a zero-shot setting. Secondly, we evaluated the full pipeline capabilities comparing its performance with the results obtained by prompting large language models in an unknown intent setting. Our results prove that Z-BERT-A represent an effective option to extend intent classification systems to handle unseen intents, a key aspect for modern dialogue systems for triage. Moreover, using a relatively lightweight base model and relying on adaptive fine-tuning, the proposed solution can be deployed in context of limited resources scenario, e.g. on-premise solutions or small cloud instances. The main limitation of Z-BERT-A currently lies in the new intent generation stage that is relying extensively on the quality of the dependency parsing. An interesting avenue to explore in the future consists in relying on zero-shot learning approaches even in the intent generation phase Liu et al. (2021) without compromising in terms of model size and inference requirements. Z-BERT-A is available at the following link: https://github.com/GT4SD/zberta.
- Dstc11-track2-intent-induction. GitHub. Note: https://github.com/amazon-research/dstc11-track2-intent-induction Cited by: Introduction.
A large annotated corpus for learning natural language inference.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: 1st item, Datasets.
- Language models are few-shot learners. CoRR abs/2005.14165. External Links: Cited by: Related Literature.
- Efficient intent detection with dual sentence encoders. CoRR abs/2003.04807. External Links: Cited by: Introduction, 2nd item, Datasets.
- Universal sentence encoder. arXiv. External Links: Cited by: Evaluation.
- PaLM: scaling language modeling with pathways. arXiv. External Links: Cited by: Related Literature.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Z-BERT-A: a zero-shot pipeline for unknown intent detection, Introduction, Approach.
- SpaCy. GitHub. Note: https://github.com/explosion/spaCy Cited by: Introduction.
- SpaCy-transformers. GitHub. Note: https://github.com/explosion/spacy-transformers Cited by: Intent generation.
- An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1373–1378. External Links: Cited by: Introduction, Intent generation, Approach.
- spaCy: Industrial-strength Natural Language Processing in Python. External Links: Cited by: Intent generation.
Parameter-efficient transfer learning for nlp. In
International Conference on Machine Learning, pp. 2790–2799. Cited by: Introduction, Approach.
- Zero-shot learning across heterogeneous overlapping domains.. In INTERSPEECH, pp. 2914–2918. Cited by: Evaluation.
- An evaluation dataset for intent classification and out-of-scope prediction. arXiv. External Links: Cited by: Introduction.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv. External Links: Cited by: Evaluation.
- Unsupervised cross-task generalization via retrieval augmentation. arXiv. External Links: Cited by: Evaluation.
- A simple meta-learning paradigm for zero-shot intent classification with mixture attention mechanism. arXiv preprint arXiv:2206.02179. Cited by: Related Literature.
- Open intent discovery through unsupervised semantic clustering and dependency parsing. arXiv preprint arXiv:2104.12114. Cited by: Related Literature, Intent generation, Conclusions and Future Work.
- NLTK: the natural language toolkit. arXiv. External Links: Cited by: Intent generation, Datasets.
- Decoupled weight decay regularization. arXiv. External Links: Cited by: Training.
- WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: Datasets.
Recent advances in deep learning based dialogue systems: a systematic survey. arXiv preprint arXiv:2105.04387. Cited by: Introduction.
- Pseudo-projective dependency parsing. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, Michigan, pp. 99–106. External Links: Cited by: Introduction, Approach.
- AdapterHub: a framework for adapting transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 46–54. Cited by: Z-BERT-A: a zero-shot pipeline for unknown intent detection, Introduction.
- Benchmarking commercial intent detection services with practice-driven evaluations. arXiv. External Links: Cited by: Introduction.
- Natural language processing for industry. Informatik-Spektrum 41 (2), pp. 105–112. Cited by: Z-BERT-A: a zero-shot pipeline for unknown intent detection.
- Sentence-bert: sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. Cited by: Zero-shot classification, Evaluation.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207. Cited by: Related Literature, Evaluation.
- Self-supervised contextual keyword and keyphrase retrieval with self-labelling. Preprints.org. External Links: Cited by: Datasets.
- Generalized zero-shot intent detection via commonsense knowledge. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1925–1929. Cited by: Related Literature.
- Attention is all you need. Advances in neural information processing systems 30. Cited by: Z-BERT-A: a zero-shot pipeline for unknown intent detection, Introduction.
- Towards open intent discovery for conversational text. arXiv. External Links: Cited by: Related Literature, Intent generation, Evaluation.
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. Note: https://github.com/kingoflolz/mesh-transformer-jax Cited by: Related Literature, Evaluation.
- Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771. Cited by: Approach.
Zero-shot user intent detection via capsule neural networks. arXiv preprint arXiv:1809.00385. Cited by: Introduction, Related Literature.
- Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence 41 (9), pp. 2251–2265. Cited by: Background, Approach.
Unknown intent detection using gaussian mixture model with an application to zero-shot intent classification. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 1050–1060. Cited by: Related Literature.
- Benchmarking zero-shot text classification: datasets, evaluation and entailment approach. arXiv preprint arXiv:1909.00161. Cited by: Introduction, Background, Datasets, Evaluation.
- Are pretrained transformers robust in intent classification? A missing ingredient in evaluation of out-of-scope intent detection. CoRR abs/2106.04564. External Links: Cited by: Introduction, 2nd item, Datasets.