DIET: Lightweight Language Understanding for Dialogue Systems

04/21/2020 ∙ by Tanja Bunk, et al. ∙ Rasa 0

Large-scale pre-trained language models have shown impressive results on language understanding benchmarks like GLUE and SuperGLUE, improving considerably over other pre-training methods like distributed representations (GloVe) and purely supervised approaches. We introduce the Dual Intent and Entity Transformer (DIET) architecture, and study the effectiveness of different pre-trained representations on intent and entity prediction, two common dialogue language understanding tasks. DIET advances the state of the art on a complex multi-domain NLU dataset and achieves similarly high performance on other simpler datasets. Surprisingly, we show that there is no clear benefit to using large pre-trained models for this task, and in fact DIET improves upon the current state of the art even in a purely supervised setup without any pre-trained embeddings. Our best performing model outperforms fine-tuning BERT and is about six times faster to train.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

11footnotetext: authors have equally contributed

Two common approaches to data-driven dialogue modeling are the end-to-end and the modular systems. Modular approaches like POMDP-based dialogue policies (williams2007partially) and Hybrid Code Networks (williams2017hybrid) use separate natural language understanding (NLU) and generation (NLG) systems. The dialogue policy itself receives the output from the NLU system and chooses the next system action, before the NLG system generates a corresponding response. In the end-to-end approach user input is directly fed into the dialogue policy to predict the next system utterance. Recently these two approaches have been combined in Fusion Networks (mehri2019structured).

In the context of dialogue systems, natural language understanding typically refers to two subtasks: intent classification and entity recognition.

goo2018slot argue that modeling these sub-tasks separately can suffer from error propagation and hence a single multi-task architecture should benefit from mutual enhancement between two tasks.

Recent work has shown that large pre-trained language models yield the best performance on challenging language understanding benchmarks (see section 2). However, the computational cost of both pre-training and fine-tuning such models is considerable (strubell2019energy).

Dialogue systems are not only developed by researchers, but by many thousands of software developers worldwide. Facebook’s Messenger platform alone supports hundreds of thousands of third party conversational assistants khari. For these applications it is desirable that models can be trained and iterated upon quickly to fit into a typical software development workflow. Furthermore, since many of these assistants operate in languages other than English, it is important to understand what performance can be achieved without large-scale pre-training.

In this paper, we propose DIET (Dual Intent and Entity Transformer), a new multi-task architecture for intent classification and entity recognition. One key feature is the ability to incorporate pre-trained word embeddings from language models and combine these with sparse word and character level n-gram features in a plug-and-play fashion. Our experiments demonstrate that even without pre-trained embeddings, using only sparse word and character level n-gram features, DIET improves upon the current state of the art on a complex NLU dataset. Moreover, adding pre-trained word and sentence embeddings from language models further improves the overall accuracy on all tasks. Our best performing model significantly outperforms fine-tuning BERT and is six times faster to train. Documented code to reproduce these experiments is available online at

2 Related Work

2.1 Transfer learning of dense representations

Top performing models (mtdnn2019; ernie2019) on language understanding benchmarks like GLUE (glue) and SuperGLUE (superglue2019) benefit from using dense representations of words and sentences from large pre-trained language models like ELMo (elmo), BERT (devlin2018bert), GPT (gpt) etc. Since these embeddings are trained on large scale natural language text corpora, they generalize well across tasks and can be transferred as input features to other language understanding tasks with or without fine-tuning (elmo; videobert; patentbert; docbert; commonsensebert)

. Different fine-tuning strategies have also been proposed for effective transfer learning across tasks 

(howardruder; howtotunebert). However, ruder2019tuning show that fine-tuning a large pre-trained language model like BERT may not be optimal for every downstream task. Moreover, these large scale language models are slow, expensive to train and hence not ideal for real-world conversational AI applications (henderson2019convert). To achieve a more compact model, henderson2019convert pre-train a word and sentence level encoder on a large scale conversational corpus from Reddit (redditdata)

. The resultant sentence level dense representations, when transferred (without fine-tuning) to a downstream task of intent classification, perform much better than embeddings from BERT and ELMo. We further investigate this behaviour for the task of joint intent classification and entity recognition. We also study the impact of using sparse representations like word level one-hot encodings and character level n-grams along with dense representations transferred from large pre-trained language models.

2.2 Joint Intent Classification and Named Entity Recognition

In recent years a number of approaches have been studied for training intent classification and named entity recognition (NER) in a multi-task setup.


proposed a joint architecture composed of a Bidirectional Gated Recurrent Unit (BiGRU). The hidden state of each time step is used for entity tagging and the hidden state of last time step is used for intent classification.

liu2016attention; jointvarghese and goo2018slot

propose an attention-based Bidirectional Long Short Term Memory (BiLSTM) for joint intent classification and NER.

haihong2019anovel introduce a co-attention network on top of individual intent and entity attention units for mutual information sharing between each task. Chen2019 propose Joint BERT which is built on top of BERT and is trained in an end to end fashion. They use the hidden state of the first special token [CLS] for intent classification. The entity labels are predicted using the final hidden states of other tokens. A hierarchical bottom-up architecture was proposed by vanzo2019hermit composed of BiLSTM units to capture shallower representations of semantic frames (baker97). They predict dialogue acts, intents and entity labels from representations learnt by individual layers stacked in a bottom-up fashion. In this work, we adopt a similar transformer-based multi-task setup for DIET and also perform an ablation study to observe its effectiveness compared to a single task setup.

3 DIET Architecture

Figure 1: A schematic representation of the DIET architecture. The phrase ”play ping pong” has the intent play_game and entity game_name with value ”ping pong”. Weights of the feed-forward layers are shared across tokens.

A schematic representation of our architecture is illustrated in Figure 1. DIET consists of several key parts.


Input sentences are treated as a sequence of tokens, which can be either words or subwords depending on the featurization pipeline. Following devlin2018bert, we add a special classification token __CLS__ to the end of each sentence. Each input token is featurized with what we call sparse features and/or dense features. Sparse features are token level one-hot encodings and multi-hot encodings of character n-grams . Character n-grams contain a lot of redundant information, so to avoid overfitting we apply dropout to these sparse features. Dense features can be any pre-trained word embeddings: ConveRT (henderson2019convert), BERT (devlin2018bert) or GloVe (pennington2014glove). Since ConveRT is also trained as a sentence encoder, when using ConveRT we set the initial embedding for __CLS__ token as the sentence encoding of the input sentence obtained from ConveRT.111Sentence embeddings from ConveRT are 1024-dimensional and word embeddings are 512-dimensional. To overcome this dimension mismatch, we use a simple trick of tiling the word embeddings to extra 512 dimensions and get 1024-dimensional word embeddings. This keeps the neural architecture the same for different pre-trained embeddings. This adds extra contextual information for the complete sentence in addition to information from individual word embeddings. For out-of-the-box pre-trained BERT, we set it to the corresponding output embedding of the BERT [CLS] token and for GloVe, to the mean of the embeddings of the tokens in a sentence. Sparse features are passed through a fully connected layer with shared weights across all sequence steps to match the dimension of the dense features. The output of the fully connected layer is concatenated with the dense features from pre-trained models.


To encode context across the complete sentence, we use a 2 layer transformer (vaswani2017attention) with relative position attention (shaw2018self). The transformer architecture requires its input to be the same dimension as the transformer layers. Therefore, the concatenated features are passed through another fully connected layer with shared weights across all sequence steps to match the dimension of the transformer layers, which in our experiments is .

Named entity recognition

A sequence of entity labels  is predicted through a Conditional Random Field (CRF) (lafferty2001conditional) tagging layer on top of the transformer output sequence corresponding to an input sequence of tokens.


where denotes negative log-likelihood for a CRF (crfloss).

Intent classification

The transformer output for __CLS__ token and intent labels

are embedded into a single semantic vector space

, , where . We use the dot-product loss (wu2017starspace; henderson2019training; vlasov2019dialogue) to maximize the similarity with the target label and minimize similarities with negative samples .


where the sum is taken over the set of negative samples and the average is taken over all examples. At inference time, the dot-product similarity serves as a ranker over all possible intent labels.


Inspired by the masked language modelling task (taylor1953cloze; devlin2018bert), we add an additional training objective to predict randomly masked input tokens. We select at random of the input tokens in a sequence. For a selected token, in of cases we substitute the input with the vector corresponding to the special mask token __MASK__, in of cases we substitute the input with the vector corresponding to a random token and in the remaining we keep the original input. The output of the transformer for each selected token is fed through a dot-product loss (wu2017starspace; henderson2019training; vlasov2019dialogue) similar to the intent loss.


where is the similarity with the target label and are the similarities with negative samples , and are the corresponding embedding vectors ; the sum is taken over the set of negative samples and the average is taken over all examples.

We hypothesize that adding a training objective for reconstructing masked input should act as a regularizer as well as help the model learn more general features from text and not only discriminative features for classification (class-reconstruct2019).

Total loss

We train the model in multi-task fashion by minimizing the total loss .


The architecture can be configured to turn off any of the losses in the sum above.


We use a balanced batching strategy (vlasov2019dialogue) to mitigate class imbalance (japkowicz2002class) as some intents can be more frequent than others. We also increase our batch size throughout training as another source of regularization (smith2017don).

4 Experimental Evaluation

In this section we first describe the datasets used in our experiments, then we describe the experimental setup, followed by an ablation study to understand the effectiveness of each component of the architecture.

4.1 Datasets

We used three datasets for our evaluation: NLU-Benchmark, ATIS, and SNIPS. The focus of our experiments is the NLU-Benchmark dataset, since it is the most challenging of the three. The state of the art on ATIS and SNIPS is already close to 100% test set accuracy, see Table 5.

NLU-Benchmark dataset

The NLU-Benchmark dataset (liu2019benchmarking), available online222, is annotated with scenarios, actions, and entities. For example, “schedule a call with Lisa on Monday morning” is annotated with the scenario calendar, the action set_event, and the entities [event_name: a call with Lisa] and [date: Monday morning]. The intent label is obtained by concatenating the scenario and action labels (e.g. calendar_set_event). The dataset has 25,716 utterances which cover multiple home assistant tasks, such as playing music or calendar queries, chit-chat, and commands issued to a robot. The data is split into 10 folds. Each fold has its own train and test set of respectively 9960 and 1076 utterances.333Some utterances appear in multiple folds. Overall 64 intents and 54 entity types are present.


ATIS (hemphill1990atis) is a well-studied dataset in the field of NLU. It comprises annotated transcripts of audio recordings of people making flight reservations. We used the same data split as Chen2019, originally proposed by goo2018slot and available online444 The training, development, and test sets contain 4,478, 500 and 893 utterances. The training dataset has 21 intents and 79 entities.


This dataset is collected from the Snips personal voice assistant (coucke2018snips). It contains 13,784 training and 700 test examples. For fair comparison, we used the same data split as Chen2019 and goo2018slot. 700 examples from the training set are used as development set. The data can be found online. The SNIPS dataset contains 7 intents and 39 entities.

4.2 Experimental Setup

Our model is implemented in Tensorflow 


. We used the first fold of the NLU-Benchmark dataset to select hyperparameters. We randomly took 250 utterances from the training set as a development set for that purpose. We trained our models over 200 epochs on a machine with 4 CPUs, 15 GB of memory and one NVIDIA Tesla K80. We used Adam 

(Adam2014) for optimization with an initial learning rate of 0.001. The batch size increased incrementally from 64 to 128 (smith2017don). Training our model on the first fold of the NLU-Benchmark dataset takes around one hour. At inference time we need around 80ms to process one utterance.

4.3 Experiments on NLU-Benchmark dataset

The NLU-Benchmark dataset contains 10 folds, each with a separate train and test set. To obtain the overall performance of our model on this dataset we followed the approach of vanzo2019hermit: train 10 models independently, one for each fold and take the average as the final score. Micro-averaged precision, recall and F1 score are used as metrics. True positives, false positives, and false negatives for intent labels are calculated as in any other multi-class classification task. An entity counts as true positive if there is an overlap between the predicted and the gold span and their labels match.

Intent Entities
F1 87.550.63 84.741.18
HERMIT R 87.700.64 82.042.12
P 87.410.63 87.650.98
sparse + F1 89.890.43 87.380.64
R 89.890.43 87.150.97
P 89.890.43 87.620.94
Table 1: Results from HERMIT (vanzo2019hermit) and from our best performing configuration of DIET on the NLU-Benchmark dataset. Our best performing model uses word and character level sparse features and combines them with embeddings from ConveRT. The model does not use a mask loss (indicated by the ).

Table 1 shows the results of our best performing model on the NLU-Benchmark dataset. Our best performing model uses sparse features, i.e. one-hot encodings at the token level and multi-hot encodings of character n-grams (). These sparse features are combined with dense embeddings from ConveRT (henderson2019convert). Our best performing model does not use a mask loss (described in Section 3 and indicated by in the table). We outperform HERMIT on intents by over 2% absolute. Our micro-averaged F1 score on entities (87.38%) is also higher than HERMIT (84.74%). HERMIT reports a similar precision value on entities, however, our recall value is much higher (87.15% compared to 82.04%).

4.4 Ablation Study on NLU-Benchmark dataset

We used the NLU-Benchmark dataset to evaluate different components of our model architecture as it covers multiple domains and has the most number of intents and entities of the three datasets.

Intent Entities
single-task: F1 90.900.19 -
intent R 90.900.19 -
classification P 90.900.19 -
single-task: F1 - 82.571.41
entity R - 81.851.87
recognition P - 83.321.51
Table 2: Training DIET on just a single task, i.e. intent classification or entity recognition, on the NLU-Benchmark dataset.

Importance of joint training

In order to evaluate if the two tasks, i.e. intent classification and named entity recognition, benefit from being optimized jointly or not, we trained models for each of the tasks individually. Table 2 lists the results of just training a single task with DIET. The results show that the performance of intent classification decreases by a single point when trained jointly with entity recognition (90.90% vs 89.89%). It should be noted that the best performing configuration for single task training for intent classification corresponds to using embeddings from ConveRT with no transformer layers555This result is in line with the results reported in casanueva2020efficient. However, the micro-averaged F1 score of entities drops from 87.38% to 82.57% when entities are trained separately. Inspecting the NLU-Benchmark dataset, this is likely due to strong correlation between particular intents and the presence of specific entities. For example, almost every utterance that belongs to the play_game intent has an entity called game_name. Also, the entity game_name only occurs together with the intent play_game. We believe that this result further brings out the importance of having a modular and configurable architecture like DIET in order to handle trade-off in performance across both tasks.

sparse dense mask loss Intent Entities
87.100.75 83.880.98
88.190.84 85.120.85
GloVe 89.200.90 84.341.03
GloVe 89.380.71 84.890.91
GloVe 88.780.70 85.060.84
GloVe 89.130.77 86.041.09
BERT 87.440.92 84.200.91
BERT 88.460.88 85.261.01
BERT 86.921.09 83.961.33
BERT 87.450.67 84.641.31
ConveRT 89.760.98 86.061.38
ConveRT 89.890.43 87.380.64
ConveRT 90.150.68 85.760.80
ConveRT 89.470.74 86.041.29
Table 3: Comparison of different featurization and architecture components on NLU-Benchmark dataset. The three columns on the left indicate whether sparse features are used or not, what kind of dense features are used, if any, and whether the model was trained with a mask loss or not. The reported numbers are micro-averaged F1 scores.

Importance of different featurization components and masking

As described in Section 3 embeddings from different pre-trained language models can be used as dense features. We trained multiple variants to study the effectiveness of each: only sparse features, i.e. one-hot encodings at the token level and multi-hot encodings of character n-grams (), and combinations of those together with ConveRT, BERT, or GloVe. Additionally, we trained each combination with and without the mask loss. The results presented in Table 3 show F1 scores for both intent classification and entity recognition and indicate multiple observations: DIET performance is competitive when using sparse features together with the mask loss, without any pre-trained embeddings. Adding a mask loss improves performance by around 1% absolute on both intents and entities. DIET with GloVe embeddings is also equally competitive and is further enhanced on both intents and entities when used in combination with sparse features and mask loss. Interestingly, using contextual BERT embeddings as dense features performs worse than GloVe. We hypothesize that this is because BERT is pre-trained primarily on prose and hence requires fine-tuning before being transferred to a dialogue task. The performance of DIET with ConveRT embeddings supports this, since ConveRT was trained specifically on conversational data. ConveRT embeddings without sparse features achieves the best F1 score on intent classification, and with the addition of sparse features it outperforms the state of the art on both intent classification and entity recognition by a considerable margin of around 3% absolute. Adding a mask loss seems to slightly hurt the performance when used with BERT and ConveRT as dense features.

Intent Entities
Fine-tuned F1 89.670.48 85.730.91
BERT R 89.670.48 84.711.28
P 89.670.48 86.781.02
sparse + F1 89.890.43 87.380.64
R 89.890.43 87.150.97
P 89.890.43 87.620.94
Table 4: Comparison of best performing feature set for DIET against fine-tunable BERT inside DIET on the NLU-Benchmark dataset. The best performing feature set for DIET contains sparse features combined with embeddings from ConveRT (not fined-tuned) without a mask loss (indicated by the ). Fine-tuning BERT with DIET takes 60 hours as compared to just 10 hours for DIET with sparse and ConveRT features.

Comparison with fine-tuned BERT

Following ruder2019tuning, we evaluate the effectiveness of incorporating BERT inside the featurization pipeline of DIET and fine-tuning the entire model. Table 4 shows DIET with frozen ConveRT embeddings as dense features and word, char level sparse features outperforms fine-tuned BERT on entity recognition while performing on par for intent classification. This result is especially important because fine-tuning BERT inside DIET on all 10 folds of NLU-Benchmark dataset takes 60 hours, compared to 10 hours for DIET with embeddings from ConveRT and sparse features.

4.5 Experiments on ATIS and SNIPS

Intent Entities Intent Entities
Joint BERT 97.90 96.10 98.60 97.00
sparse + 96.59 95.08 98.03 94.79
sparse + 96.31 94.99 97.50 94.84
96.61 95.37 97.71 95.10
Table 5: Results of Joint BERT (Chen2019) and different feature sets for DIET on the ATIS and SNIPS datasets. Reported numbers are accuracy for intents and micro-average F1 score for entities. The indicates that the data was annotated using the BILOU tagging schema. implies that no mask loss was used.

In order to compare our results to the results presented in Chen2019, we use the same evaluation method as Chen2019 and goo2018slot. They report the accuracy for intent classification and micro-averaged F1 score for entity recognition. Again, true positives, false positives, and false negatives for intent labels are obtained as in any other multi-class classification task. However, an entity only counts as a true positive if the prediction span exactly matches the gold span and their label match, a stricter definition than that of vanzo2019hermit. All experiments on ATIS and SNIPS were run 5 times. We take the average over the results from those runs as final numbers.

To understand how transferable the hyperparameters of DIET are, we took the best performing model configurations of DIET on the NLU-Benchmark dataset and evaluated them on ATIS and SNIPS. The intent classification accuracy and named entity recognition F1 score on the ATIS and SNIPS dataset are listed in Table 5.

Due to the stricter evaluation method we tagged our data using the BILOU tagging schema (ramshaw1995text). The use of the BILOU tagging schmea is indicated by the in Table 5.

Remarkably, using only sparse features and no pre-trained embeddings whatsoever, DIET achieves performance within 1-2% of the Joint BERT model. Using the hyperparameters from the best performing model on the NLU-Benchmark dataset, DIET achieves results competitive with Joint BERT on both ATIS and SNIPS.

5 Conclusion

We introduced DIET, a flexible architecture for intent and entity modeling. We studied its performance on multiple datasets, and showed that DIET advances the state of the art on the challenging NLU-Benchmark dataset. Furthermore we extensively study the effectiveness of using embeddings from various pre-training methods. We find that there is no single set of embeddings which is always best across different datasets, highlighting the importance of a modular architecture. Furthermore we show that word embeddings from distributional models like GloVe are competitive with embeddings from large-scale language models, and that in fact without using any pre-trained embeddings, DIET can still achieve competitive performance, outperforming state of the art on NLU-Benchmark. Finally, we also show that the best set of pre-trained embeddings for DIET on NLU-Benchmark outperforms fine-tuning BERT inside DIET and is six times faster to train.