. It is an essential component of dialogue systems and is therefore widely used in real-world applications, such as personal assistants and customer service. In these systems, ID models usually classify a user utterance into an intent class. For example, an ID model should be able to recognize the intent of “booking a flight” from the utterance “I am flying to Chicago next Wednesday”.
Existing ID models usually adopt an offline learning paradigm, which performs once-and-for-all training on a fixed dataset. This paradigm can only handle a fixed number of user intents. However, online dialogue systems typically need to handle continually emerging new user intents, which makes previous ID models impractical in real-world applications. Recently, lifelong learning has received increasing attention and is considered to be the most promising approach to address this problem (Ring, 1995; Thrun, 1998). Therefore, to handle continually emerging new intents, we propose the Lifelong Intent Detection (LID) task, which introduces lifelong learning into the ID task. As shown in Fig 1, the LID task continually trains an ID model using only new data to learn newly emerging intents. At any time, the updated ID model should be able to perform accurate classifications for all classes observed so far. In this task, it is infeasible to retrain the ID model from scratch every time new data becomes available due to storage budgets and computational costs (Cao et al., 2020).
A plain lifelong learning method is to fine-tune a model pre-trained on old data directly on new data. However, this method faces a serious challenge, namely catastrophic forgetting, where models fine-tuned on new data usually suffer from a significant performance degradation on old data (McCloskey and Cohen, 1989; French, 1999). To address this issue, the current mainstream lifelong learning methods either identify and retain parameters that are important to the old data (Kirkpatrick et al., 2017; Aljundi et al., 2018), or maintain a memory to reserve a small number of old training samples (known as the reply-based methods) (Rebuffi et al., 2017; Wang et al., 2019)
. At each time, reply-based methods combine the reserved old data with the new data to retrain the model. Due to the simplicity and effectiveness of replay-based methods, they become an excellent solution for lifelong learning in natural language processing scenarios(Han et al., 2020; Cao et al., 2020).
However, when adapting existing replay-based methods to lifelong intention detection, our study found that these methods suffer from a data imbalance problem. Specifically, at each step of the lifelong learning process, there is generally a large amount of new class data, yet only a small amount of old data is reserved, leading to a significant imbalance between old and new data. Under such circumstances, the focus of the training process will be significantly biased towards new classes, thus leading to a series of negative effects in the ID model, as shown in Figure 2
: (1) Magnitude Imbalance: the magnitude of feature vectors and class embeddings of new classes is significantly larger than those of old classes, (2) Knowledge Deviation: the knowledge of the previous model, i.e., the feature distribution and the probability distribution of old classes, is not well preserved, (3) Class Confusion: the class embeddings of new classes and those of old classes are very close to each other in the high-dimensional vector space. These adverse effects severely mislead the ID model, causing it to tend to predict new classes while catastrophically forgetting old classes.
Our work is inspired by lifelong learning in image classification tasks (Hou et al., 2019; Castro et al., 2018; Tao et al., 2020), which also targets the data imbalance problem. In this paper, we find multiple adverse effects caused by the imbalance problem in the LID task and propose corresponding solutions.
To address the problem of data imbalance, we propose a novel lifelong learning framework, namely Multi-Strategy Rebalancing (MSR), which aims to learn a balanced ID model. Specifically, MSR contains three components to alleviate the above three adverse effects: (1) Cosine Normalization, which balances the magnitude of feature vectors and class embeddings between old and new classes by constraining these vectors in a high-dimensional sphere to eliminate the bias caused by the difference in magnitude. (2) Hierarchical Knowledge Distillation, which preserves the knowledge of the previous model from the feature level and the prediction level to retain the feature distribution and the probability distribution of old classes. (3) Inter-Class Margin Loss, which provides a large margin to separate the new class embeddings and the old class embeddings. With multi-strategy rebalancing, the ID model can effectively handle the adverse effects caused by data imbalance. We constructed four benchmarks for the LID task based on four widely used ID datasets to systematically compare different lifelong learning methods(Hemphill et al., 1990; Coucke et al., 2018; Liu et al., 2019; Larson et al., 2019). Experimental results show that our proposed framework significantly outperforms previous state-of-the-art lifelong learning methods on these benchmarks.
In summary, the contributions of this work are as follows:
To the best of our knowledge, we are the first to propose the Lifelong Intent Detection task, meanwhile constructed four benchmarks through four widely used ID datasets: ATIS, SNIPS, HWU64, and CLINC150.
We propose the Multi-Strategy Rebalancing framework, which can effectively handle the data imbalance problem in the LID task through cosine normalization, hierarchical knowledge distillation, and inter-class margin loss.
Experimental results show that our method outperforms previous lifelong learning methods and achieves state-of-the-art performance. The source code and benchmarks will be released for further research (http://anonymous).
2. Task Formulation
Intent detection is usually formulated as a multi-class classification task, which predicts an intent class for a given user utterance (Hemphill et al., 1990; Coucke et al., 2018; Zhang et al., 2019; E et al., 2019). In real-world applications, online systems inevitably face continually emerging new user intents. Therefore, we propose the Lifelong Intent Detection task, which continually trains the ID model on emerging data to learn new classes. In this task, there is a sequence of data . Each data () has its own label set (), i.e., one or more intent classes, and training/validation/testing sets (, , ). At each step, the lifelong learning framework trains the ID model on the new training set () to learn the new classes in . The LID task requires that the ID model should perform well on all observed classes. Therefore, after training on , the updated ID model will be evaluated on all observed testing sets (i.e., ) and uniformly classify each sample into all known classes (i.e., ).
In this work, we propose Multi-Strategy Rebalancing to handle the data imbalance problem in the LID task. In this section, we will first show a typical replay-based method, iCaRL (Rebuffi et al., 2017), as the background. Next, we deeply analyze the data imbalance problem and introduce the proposed solutions, which are shown in Figure 3.
A typical ID model contains two components: an encoder and multiple class embeddings. The encoder can be recurrent neural networks or pre-trained models(Cho et al., 2014; Devlin et al., 2019). We adopt the current best encoder, BERT (Devlin et al., 2019), as our encoder. BERT is a multi-layer Transformer (Vaswani et al., 2017) that is pre-trained on large-scale unlabeled corpora. It encodes each sample into a sentence-level feature vector, i.e., the hidden state of the “[CLS]” token. Then, the ID model calculates the dot product similarity between the feature vector and the class embeddings as the class probability. The loss of the ID model is the standard cross-entropy loss:
where is the set of all observed classes. is the one-hot ground-truth label. is the class probability obtained by softmax.
To overcome catastrophically forgetting old data, iCaRL (Rebuffi et al., 2017) maintains a bounded memory to store a few representative old samples, which aims to introduce important information about the data distribution of previous classes into the training process. The memory can be denoted as , where is the set of samples reserved for the th class. After training on the new data, iCaRL selects the most representative samples for each class in this data through a class prototype (Snell et al., 2017), which is calculated by averaging the feature vectors of all training samples of that class. Based on the distance between the feature vector of each training sample and the prototype, iCaRL sorts the training samples of each class and selects the top nearest samples as exemplars to store, where is the memory size and is the number of all observed classes. To allocate space for the current classes, iCaRL removes training samples for each old class, where is the number of new classes. iCaRL removes samples that are far from the prototype according to the sorted list. In this way, the most representative samples are reserved in the memory.
In addition, iCaRL combines the cross-entropy loss with a knowledge distillation (KD) loss (Hinton et al., 2015) to retrain the model. The distillation loss enables the model at the current step to learn the probability distribution of the model trained in the last step:
are the soft labels (i.e., the results before the softmax layer) predicted by the last model and the current model for old classes (), respectively. . is the temperature scalar, which is used to increase the weight of small probability values. The KD loss is an effective way to alleviate catastrophic forgetting by learning the soft label of the last model.
However, at each step, the new data is usually significantly more than the reserved old data, leading to a serious data imbalance problem. It makes previous methods tend to predict new classes and catastrophically forgetting old classes.
3.2. Multi-Strategy Rebalancing
In this work, we address the data imbalance problem from multiple aspects by incorporating three components, cosine normalization, hierarchical knowledge distillation, and inter-class margin loss.
3.2.1. Cosine Normalization
We find that the magnitude of both feature vectors and class embeddings of new classes is significantly larger than that of old classes. It may make the current model tend to predict new classes. To solve this problem, we replace the original dot product similarity with cosine normalization as:
measures the cosine similarity between the feature vectorand the class embedding . The hyper-parameter is used to control the peak of the softmax distribution since the cosine similarity ranges between -1 and 1. Geometrically, we constrain these vector in a high-dimensional sphere to effectively eliminate the bias caused by the imbalanced magnitudes.
3.2.2. Hierarchical Knowledge Distillation
The knowledge (i.e., the feature distribution and the probability distribution) of the model trained on new data usually deviates heavily from that of the model trained on old data. It makes the model forget the important information of old classes. We propose hierarchical knowledge distillation to preserve the previous knowledge from two levels.
In the Feature-Level KD, we reserve the geometric structure of the feature vector of the current model by reducing the angle between it and the feature vector of the last model:
where is the feature vector extracted by the last model.
encourages the features extracted by the current model to be close to the features extracted by the last model in the high-dimensional sphere. Besides, we fix the old class embeddings to reserve their spatial structure.
In the Prediction-Level KD, we encourage the current model to reserve the probability distribution of the last model through a knowledge distillation loss, as in Eq. 2, which learns the soft label predicted by the last model.
3.2.3. Inter-Class Margin Loss
Another negative effect of the imbalance problem is class confusion, i.e., new and old class embeddings are usually mixed in the high-dimensional space. This is due to the fact that a large number of new training samples are likely to activate neighboring samples with different labels (Hou et al., 2019; Tao et al., 2020). To solve this problem, we introduce an inter-class margin loss to separate these class embeddings as:
where is the margin. This loss expects the angle between (,) to be greater than
. Through this loss, these embeddings can be uniformly distributed on the high-dimensional sphere without confusion.
At each step of LID, our MSR framework combines the above losses to train the ID model on the new data and the reserved old data. The overall loss is defined as follows:
where , , and are hyper-parameters to balance the performance between old and new classes. , , and are calculated for both the new data and the reserved old data. is calculated for all new class embeddings.
4.1. Lifelong Intent Detection Benchmarks
Since we are the first to propose the LID task, we construct four benchmarks based on the following method: for an ID dataset, we arrange its classes in a fixed random order. Each class has its own data. In a class-incremental manner, the lifelong learning methods continually train an ID model on one or multiple new classes. Based on four widely used datasets, ATIS (Hemphill et al., 1990), SNIPS (Coucke et al., 2018), HWU64 (Liu et al., 2019), CLINC150 (Larson et al., 2019), we constructed four benchmarks. To provide a comprehensive evaluation, we set different numbers of new classes per step in different benchmarks. We set 1, 1, 5, and 15 new classes per step in the ATIS, SNIPS, HWU64, and CLINC150 benchmarks, respectively. Since the class data in ATIS and HWU64 has a long-tail distribution, we use the data of the top 10 and 50 frequent classes. The statistics of the four benchmarks are shown in Appendix A.
4.2. Implementation Details
At each step of the LID task, we report the accuracy on the testing data of all observed classes, denoted as . After the last step, we report Average Acc, which is the average accuracy of all step (), and Whole Acc, which is the accuracy on the whole testing data of all classes. We use BERT in the HuggingFace’s Transformers library. All hyper-parameters are obtained by a grid search on the validation set. The learning rate is and the batch size is 64. The hyper-parameters , , ,, and are , , , , and . in our method. The memory size is 200.
In this work, we propose a model-agnostic lifelong learning method to handle the LID task. Therefore, we adopt other model-agnostic lifelong learning methods that achieve state-of-the-art performance on other tasks as our baselines. EWC (Wang et al., 2019) adopts an loss to slow down the update of important parameters. LwF (Li and Hoiem, 2017) uses knowledge distillation to learn the soft labels of the last model. EMR (Wang et al., 2019) randomly stores some old samples. iCaRL (Rebuffi et al., 2017) combines knowledge distillation and prototype-based sample selection in their method. EEIL (Castro et al., 2018) handles the data imbalance problem by resampling a balanced subset. EMAR (Han et al., 2020)
uses K-Means to select samples and consolidates the model by old prototypes.FineTune directly fine-tunes the pre-trained model on new data. UpperBound use training data of all observed classes to train the model, which is regarded as the upper bound.
4.4. Main Results
Figure 4 shows the accuracy () during the whole lifelong learning process. We also list Average Acc and Whole Acc after the last step in Appendix B. From the results, we can see that: (1) our MSR achieves state-of-the-art performance, significantly outperforming the baselines by 2.27%, 1.68%, 3.16%, and 3.57% whole accuracy on the ATIS, SNIPS, HWU64, CLINC150 benchmarks, respectively. These baselines either ignore the data imbalance problem or handle it by a simple resampling approach, which leads to catastrophic forgetting. (2) compared to EMAR, our method saves computation time because our method is more refined. (3) There is still a gap between our method and the upper bound. It indicates that there remain some challenges to be addressed.
4.5. Ablation Study
In this section, we perform ablation studies on the proposed three components. The results are shown in Appendix C. Removing any component brings a performance degradation. It shows that our method can alleviate catastrophic forgetting through multi-strategy rebalancing, which addresses multiple adverse effects caused by the data imbalance problem.
In this paper, we propose the lifelong intent detection task to handle continually emerging user intents. In addition, we propose multi-strategy rebalancing to address multiple adverse effects caused by the data imbalance problem. Experimental results on four constructed benchmarks demonstrate the effectiveness of our method.
Memory aware synapses: learning what (not) to forget. In Proceedings of the ECCV, pp. 139–154. Cited by: §1.
- Incremental event detection via knowledge consolidation networks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 707–717. Cited by: §1, §1.
- End-to-end incremental learning. In Computer Vision - ECCV 2018, Vol. 11216, pp. 241–257. Cited by: §1, §4.3.
- Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of EMNLP, pp. 1724–1734. Cited by: §3.1.
- Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. CoRR abs/1805.10190. External Links: Cited by: §1, §1, §2, §4.1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the NAACL-HLT, pp. 4171–4186. Cited by: §3.1.
- A novel bi-directional interrelated model for joint intent detection and slot filling. In Proceedings of the 57th ACL, pp. 5467–5471. Cited by: §2.
- Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3 (4), pp. 128–135. Cited by: §1.
- Continual relation learning via episodic memory activation and reconsolidation. In Proceedings of the 58th ACL, pp. 6429–6440. Cited by: §1, §4.3.
- The ATIS spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop, Cited by: §1, §1, §2, §4.1.
Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.1.
- Learning a unified classifier incrementally via rebalancing. In IEEE Conference on CVPR, pp. 831–839. Cited by: §1, §3.2.3.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §1.
- An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 EMNLP-IJCNLP, pp. 1311–1316. Cited by: §1, §4.1.
- Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §4.3.
Benchmarking natural language understanding services for building conversational agents. In Increasing Naturalness and Flexibility in Spoken Dialogue Interaction - 10th IWSDS, Vol. 714, pp. 165–183. Cited by: §1, §4.1.
- Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §1.
- Icarl: incremental classifier and representation learning. In Proceedings of the IEEE conference on CVPR, pp. 2001–2010. Cited by: §1, §3.1, §3, §4.3.
- Continual learning in reinforcement environments. Ph.D. Thesis, University of Texas at Austin, TX, USA. Cited by: §1.
- Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §3.1.
- Few-shot class-incremental learning. In IEEE/CVF Conference on CVPR, pp. 12180–12189. Cited by: §1, §3.2.3.
- Lifelong learning algorithms. In Learning to Learn, pp. 181–209. Cited by: §1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.1.
- Sentence embedding alignment for lifelong relation extraction. In Proceedings of the 2019 Conference of the NAACL-HLT, pp. 796–806. Cited by: §1, §4.3.
Unknown intent detection using gaussian mixture model with an application to zero-shot intent classification. In Proceedings of the 58th ACL, pp. 1050–1060. Cited by: §1.
- Joint slot filling and intent detection via capsule neural networks. In Proceedings of the 57th ACL, pp. 5259–5267. Cited by: §2.
Appendix A Statistics of benchmarks
In this section, we show the statistics of the four constructed benchmarks in Table 1.
Appendix B Results on the four benchmarks
In this section, we list the results after the last step in Table 2. The average accuracy of all steps and the whole accuracy of the whole testing data are shown in different columns. In both metrics, our method MSR significantly outperforms the baselines and achieves state-of-the-art performance on the four benchmarks. It implies that our method is effective in handling the LID task via multi-strategy rebalancing.
|Average Acc||Whole Acc||Average Acc||Whole Acc||Average Acc||Whole Acc||Average Acc||Whole Acc|
|Average Acc||Whole Acc||Average Acc||Whole Acc||Average Acc||Whole Acc||Average Acc||Whole Acc|
|- CN and HKD||97.79||95.23||96.29||91.43||58.97||47.78||87.34||72.23|
Appendix C Ablation Study
Our method consists of three components: cosine normalization, hierarchical knowledge distillation, and inter-class margin loss. We show the ablation studies of the three components. The results are shown in Table 3. For “- CN”, we replace cosine normalization with the dot product similarity. For “- FKD”, we remove the feature-level knowledge distillation. For “- PKD”, the prediction-level knowledge distillation is removed. For “- HKD”, this model does not adopt the proposed hierarchical knowledge distillation. For “- ICML”, the model removes the inter-class margin loss. For “- CN and HKD”, we remove both cosine normalization and hierarchical knowledge distillation. The model without multi-strategy rebalancing (“- MSR”, i.e., the model EMR) is shown in the last row. We can see that these variants achieve low performance. It indicates that simultaneously utilizing these multiple strategies is very effective.