Spoken language understanding (SLU) is a key component of task-oriented dialogue systems, which assist user to complete tasks such as booking flight tickets. SLU parses user utterances into semantic frames, including intents, slots, and user dialogue acts . The semantic frame for a restaurant reservation query is shown in Figure 1. Both intents and user dialogue acts represent the user’s intentions, but intents and user acts could be defined with different granularities. In this work, we model both intent and user act classification when both of them are available in the dialogue corpora.
Previous research in SLU has significantly focused on single-turn SLU, that is, understanding the current user utterance. However, completing a task usually necessitates multiple turns of back-and-forth conversations between the user and the system. Multi-turn SLU imposes different challenges from single-turn SLU, for example, entities introduced earlier in conversation may be referred later by the user and the system, information mentioned earlier may be skipped later, causing ambiguities, as shown in Figure 1. Incorporating contextual information has been shown useful for multi-turn SLU [2, 3, 4, 5, 6, 7]. Information from previous intra-session utterances was explored by applying SVM-HMMs to sequence tagging for SLU 
. Contextual information was incorporated into the recurrent neural network (RNN) structure[3, 6]. Chen et al.  proposed a memory network based approach for multi-turn SLU by encoding history utterances and leveraging the memory embeddings through attention. Bapna et al. 
enhanced the memory network architecture by adding a BiRNN session encoder temporally combining the current utterance encoding and the memory vectors. Su et al. investigated different time-decay attention mechanisms. Gupta el al.  proposed an approach to encode system dialogue acts for SLU, substituting the use of system utterances. Also, various models have been proposed for jointly modeling intent and slot predictions and achieved significant performance improvement over models that model these predictions independently [11, 12, 13, 14, 15, 16]. In this work, we also follow the joint learning paradigm.
However, lack of human-labeled data for SLU results in poor generalization capability. A variety of transfer learning (TL) techniques were proposed for addressing the data sparsity challenge. One category of TL approaches includes training general purpose language representation models using a large amount of unlabeled text, such as ELMo , GPT , and BERT . Pre-trained models can be fine-tuned on NLP tasks and have achieved significant improvement over training on the task-specific annotated data. Bapna et al.  leveraged slot name and description encodings within a multi-task model for domain adaptation. Lee et al.  proposed zero-shot adaptive transfer for slot tagging by embedding the slot descriptions and fine-tuning a pre-trained model on the target domain. Siddhant et al.  used a light-weight ELMo model for pre-training and unsupervised and supervised transfer.
Our contribution in this paper is threefold: First, we propose a Context Encoding Language Transformer (CELT) model for context-aware SLU. Different from previous work of exploring various encoding schemes and attention mechanisms to encode context for multi-turn SLU, CELT facilitates encoding various context information in the dialogue history for SLU, such as user and system utterances, speaker information, system acts, and utilizing these information in a unified framework through a multi-head self-attention mechanism. The context information that CELT can exploit is extensible. For example, for a conversational system that facilitates the use of a screen for multi-modal interactions, screen-displayed information can be treated similarly as context in CELT and help understand user query. Second, we develop a multi-step TL approach on CELT, namely, unsupervised pre-training to exploit large-scale general purpose unlabeled text, unsupervised adaptive training and supervised adaptive training to exploit other in-domain and out-of-domain dialogue corpora. To our knowledge, the CELT model and the multi-step TL approach on CELT are first proposed in this work for multi-turn SLU. Third, we systematically evaluate the efficacy on SLU from various context information and TL approaches. The proposed CELT model together with the proposed TL approaches significantly outperform the state-of-the-art performance on two large-scale single-turn dialogue benchmarks and one large-scale multi-turn dialogue benchmark.
2 Proposed Approach
provides a high-level illustration of CELT, which consists of the input embedding layer, the encoder representation layer, and the final classifier layer.
2.1 Input Embedding Layer
Given the current query token sequence at turn , and previous turns in the dialogue session, i.e., user turns , and system turns , the target is to predict the semantic frame, including intent, user acts, and slots, for . We concatenate all the previous turns chronologically in a dialogue session and the current query as the input text , i.e., . The first token of every input text is always the special classification embedding ([CLS]) which is used to predict the intent and user acts. Each utterance in the previous turns is inserted an end-of-utterance ([EOU]) token. The previous utterances and the current turn are separated by a special token ([SEP]).
For a token in the input text, its input embedding is an element-wise sum of token embeddings, position embeddings, segment embeddings, and embeddings of other context information. In this work, we add speaker embeddings and system act embeddings into the sum to obtain the final input embedding. Figure 3 illustrates the input embedding layer.
The learned WordPiece embeddings  are used to alleviate the out-of-vocabulary (OOV) problem. The learned position embeddings are used to capture the sequence order information. The learned segment embeddings are used to distinguish the previous turns and the current query, hence all previous turns have the same segment embeddings. Speaker embeddings are used to distinguish the user’s turns or the system’s turns, considering that speaker role information has been shown useful for SLU in complex dialogues . System act embeddings encode the system act information. Each system act contains an act type and optional slot and value parameters. The acts are categorized into two broad types: acts with an associated slot (i.e. request(date), select(time=7pm)) and acts without associated slots (e.g. negate). We keep the slot type, ignore the slot values, and convert the system acts into embeddings just like word embeddings. We define an n-hot binary vector to represent system acts of the previous system turn in the system act vocabulary 111We only use the system acts from the previous system turn instead of using all system acts in the dialogue history, in order to keep the same setting as Gupta et al. ., and convert the n-hot binary vector to a fixed-sized vector by multiplying it with a system act embedding matrix , that is, .
2.2 Encoder Representation Layer
The encoder representation layer is a multi-layer Transformer  consisting of multi-head self-attention sub-layer and feed-forward sub-layer in each layer. The multi-head self-attention mechanism builds upon scaled dot-product attention, operating on query , key , and value :
where is the dimension of the keys. Multi-head attention mechanisms obtain different representations of , compute scaled dot-product attention for each representation, and concatenate the results. This can be expressed in the same notation as Equation (1):
where are projection matrices .
The concatenation is projected with the feed-forward neural network (FFN) sub-layer. We propose using a two-layered network with GELU activation. Given trainable weights , , , , this sub-layer is defined as:
After feeding the input embedding sequence into the encoder representation layer, the output hidden states are , where corresponds to [CLS].
2.3 Final Classifier Layer
2.3.1 Intent and user act classification
We assume that each user query contains a single intent and multiple user dialogue acts. Intent classification (IC) predicts the intent probability distribution, using Equation (5). User act classification (UAC) is defined as a multi-label binary classification problem. The probability of the presence of the -th user act in the user query, , is calculated by Equation (6).
where and are non-linear feed-forward layers with tanh activation. During inference, the intent label is predicted by , and user acts are predicted when the probability is greater than the threshold , where
is a hyperparameter tuned on the validation set.
2.3.2 Slot filling
Slot filling (SF) identifies the values for different slots present in the user utterance. We use the BIO (begin-inside-outside) tagging scheme to assign a label to each token. We feed the final hidden states
into a softmax layer to classify over the SF labels. To make this procedure compatible with the WordPiece tokenization, we feed each tokenized input word into a WordPiece tokenizer and use the hidden state corresponding to the first sub-token as the input to the softmax classifier.
where is the hidden state corresponding to the first sub-token of word in the current query , and is a non-linear feed-forward layer with GELU activation.
For joint learning, the objective is to minimize the sum of the softmax cross-entropy losses of IC and SF and the sigmoid cross-entropy loss of UAC. Previous work has shown that an additional CRF layer on top of a BiLSTM can improve the performance for sequence tagging [26, 27]. Hence we investigate the efficacy of adding a CRF layer for modeling slot label dependencies, on top of the Encoder Representation Layer, similar to .
2.4 Transfer Learning
To improve SLU on the target domains and reduce dependency on data collection and annotation, we explore TL based on the CELT model, by leveraging large-scale unlabeled text and other multi-turn dialogue corpora either unlabeled or labeled with different intent/slot/dialog act labels, from the same or different domains w.r.t. the target domains. We develop a multi-step transfer learning approach. In the first step, to exploit large-scale unlabeled text, we use unsupervised pre-training based on the BERT model with two tasks trained together, i.e., masked language model (MLM) and next sentence prediction (NSP) . The resulting model is denoted . Next, to exploit other dialogue corpora, we propose two TL methods, namely, unsupervised adaptive training and supervised adaptive training. In the second step, the unsupervised adaptive training approach trains on other unlabeled dialogue corpora, using the MLM and NSP losses. The resulting model is denoted . In the third step, given other labeled dialogue corpora, the proposed supervised adaptive training approach fine-tunes on the labeled data, based on the combined loss of IC and SF222Note that in the combined loss, when samples have multiple intent labels, sigmoid cross-entropy loss is used for IC instead of softmax cross-entropy loss.. The resulting weights for the input embedding layer and the encoder representation layer of CELT are then used to initialize the new CELT model. This model is denoted for the next step fine-tuning for the target domain SLU. This way, our supervised adaptive training can exploit labeled data with intent/slot labels different from labels used for the target domains. In the fourth step, is fine-tuned on the target domain labeled data based on the combined loss of IC, SF, and UAC.
3 Experiments and Analysis
We conduct two sets of experiments. For the first set of experiments, we evaluate the efficacy of BERT pre-train and joint modeling of IC and SF in CELT on the single-turn ATIS  and Snips  dialogue corpora. ATIS includes audio recordings of people making flight reservations. Snips is collected from the Snips personal voice assistant. We use the same data division as  for both datasets. The data statistics are summarized in Table 1. For these two datasets, system act embeddings and user act classifier are not used, because system and user dialogue acts are not annotated. Speaker embeddings are not used since there is only the current query in single-turn dialogues.
For the second set of experiments, we evaluate the proposed model and TL approaches on the multi-turn Google Simulated Dialogues (GSD)333https://github.com/google-research-datasets/simulated-dialogue . We explore Microsoft Dialogue Challenge (MDC)444https://github.com/xiul-msr/e2e_dialog_challenge  and MultiWOZ 2.0 (WOZ)555https://www.repository.cam.ac.uk/handle/1810/280608. Note that WOZ does not have user act annotations so cannot be directly used for SLU.  datasets as other dialogue corpora for evaluating the proposed TL approaches. We use the same data division as . The GSD dataset covers restaurant (GSD-Resturant) and movie (GSD-Moive) domains. The entire GSD dataset (GSD-Overall) consists of 3 intents, 12 slot types, and 22 user dialogue act types. The data statistics are summarized in Table 1. Note that the 3 intents (“BUY_MOVIE_TICKETS”, “FIND_RESTAURANT”, “RESERVE_RESTAURANT”) are quite high-level; instead, the 22 user dialog act types provide the user intent information for the SLU task. The MDC dataset covers restaurant, movie, and taxi domains, with 4103, 2890, and 3094 training dialogues, 11 intents, and 30, 29, and 19 slots for the three domains, respectively. The WOZ dataset consists of human-human written conversations spanning 7 domains and 10,438 dialogues in total.
3.2 Training Details
The Transformer block in CELT has 12 layers, 768 hidden states, 3072 feed-forward size, and 12 self-attention heads. The size of hidden states in the final classifier layer is 768. For pre-training, we use the English uncased BERT-Base model666https://github.com/google-research/bert, pre-trained on the BooksCorpus 
and English Wikipedia. For unsupervised/supervised adaptive training on MDC and WOZ and fine-tuning on the GSD-overall dataset, all hyper-parameters are tuned on the GSD-overall validation set. For the first set of experiments on ATIS and Snips, the maximum sequence length is 50, the batch size is 128, and the number of training epochs is 30. For the second set of experiments on multi-turn dialogues, the maximum sequence length is 128 and the batch size is 32. Adam is used for optimization. The initial learning rate is 5e-5 for the supervised adaptive training and fine-tuning (in both sets of experiments), and 2e-5 for the unsupervised adaptive training. The dropout probability is 0.1. The mask probability of the MLM task is 15% for the unsupervised adaptive training. We compare using different numbers of previous user and system turns in the dialogue session and observe the best SLU performance from using all previous turns. The threshold for user act classification is selected from and tuned on the validation set.
3.3 Results and Discussion
3.3.1 Single-Turn SLU
|Capsule Neural Networks ||97.3||91.8||80.9||95.0||95.2||83.4|
|(1) CELT w/o BERT pre-train||97.80.2||90.00.6||79.31.4||96.90.1||92.70.1||80.50.4|
|(2) CELT w/o BERT pre-train + CRF||97.90.3||90.80.2||80.90.5||97.00.3||93.10.2||81.60.4|
|(3) (1) w/ BERT pre-train||98.30.3||96.40.2||91.90.2||97.40.4||95.90.1||87.90.4|
|(4) (2) w/ BERT pre-train||98.30.1||96.50.2||91.80.5||97.60.1||95.70.1||87.60.2|
SLU performance on the single-turn Snips and ATIS testsets. Note that since Snips and ATIS are single-turn dialogues, all models in this table do not use context information. All models are trained and tested on the same training and test partitions of Snips and ATIS, respectively (no transfer learning is applied). The mean and standard deviation of SLU results from CELT w/o and with BERT pre-train, w/o and with replacing the softmax layer with a CRF layer, from 5 different models with different random initialization are given here. The metrics are the intent classification accuracy, slot filling F1, and sentence-level semantic frame accuracy. The results for the first group of models are cited from[15, 16].
Table 2 shows the SLU performance as SF F1, IC accuracy, and sentence-level semantic frame accuracy on the Snips and ATIS datasets. The first group of models is considered as the baselines and it consists of the state-of-the-art joint IC and SF models: the sequence-based joint model using BiLSTM , the attention-based model , the slot-gated model , and the capsule neural network based model .
The second group of models in Table 2 includes the proposed CELT models. CELT with BERT pre-train significantly outperforms the baselines on both datasets. Compared to ATIS, Snips includes multiple domains and has a larger vocabulary. For the more complex Snips dataset, CELT with BERT pre-train achieves intent accuracy of 98.3% (from 97.3%), slot F1 of 96.4% (from 91.8%), and sentence accuracy of 91.9% (from 80.9%). On ATIS, CELT achieves intent accuracy of 97.4% (from 95.0%), slot F1 of 95.9% (from 95.2%), and sentence accuracy of 87.9% (from 83.4%). The gain from CELT with BERT pre-train on Snips over the baselines is much more significant on slot F1 and sentence frame accuracy than intent accuracy. Further analysis shows that 53.4% of slots in the Snips test set can be found in Wikipedia. Since BERT pre-training data includes English Wikipedia, the model may have encoded the knowledge in representations and hence improves slot F1 and sentence accuracy.
Without BERT pre-train, the SLU performance degrades drastically on both datasets. These results demonstrate the strong generalization and semantic representation capability of the BERT pre-train model, considering that it is pre-trained on large-scale text from mismatched domains and genres (books and Wikipedia). Without BERT pre-train, replacing the softmax layer with CRF consistently improves the sentence accuracy (3% and 2.4% relative gains for Snips and ATIS, respectively); whereas, adding CRF for CELT with BERT pre-train performs comparably. Hence, the second set of experiments uses CELT without CRF.
Ablation analysis on Snips shows that when fine-tuning the BERT pre-train model separately for IC and SF, intent accuracy drops to % (from %), and slot F1 drops to % (from %). These results demonstrate that joint modeling in CELT improves the performance for both tasks. We compare CELT models with different fine-tuning epochs. The CELT model fine-tuned with only 1 epoch already outperforms the baselines in Table 2.
|j. b-BERT pre-train||99.92||92.44||91.12||86.54|
|k. j-speaker embeddings||99.96||92.22||90.43||86.15|
|l. k-context utterances||94.52||93.55||91.61||82.35|
|m. l-system act embeddings||77.33||89.11||90.93||66.88|
3.3.2 Multi-Turn SLU and Transfer Learning
Table 3 shows the IC accuracy, UAC F1, SF F1, and sentence frame accuracy on the GSD test sets. The first group of models includes the baselines. RNN-NoContext  uses two-layer stacked BiRNN with GRU and LSTM cells respectively, and no context information is used. RNN-PreviousTurn  is similar to the RNN-NoContext model, with a different BiGRU layer encoding the previous system turn for slot tagging. MemNet  uses memory network to encode the dialogue history utterances from both user and system. SDEN  uses the dialogue history utterances from both user and system through a BiGRU for combining memory embeddings. HRNN-SystemAct  is the previous state-of-the-art (SOTA) system, using a hierarchical RNN to encode the dialogues acts of the previous system turn as the context information.
The second group of models includes the proposed CELT model after applying the proposed TL approaches. CELT achieves new SOTA results and the absolute gains from CELT over the previous SOTA user act F1 and sentence frame accuracy are 2.8% and 6.11% on the GSD-Restaurant testset, 0.89% and 10.12% on the GSD-Movie testset, and 2.62% and 7.44% on the GSD-Overall testset. Table 4 shows the ablation analysis on the GSD-Overall test set. Removing all unsupervised and supervised adaptive training for CELT (CELT-UA-SA) degrades the sentence accuracy from 95.02% to 93.00%. Further removing the unsupervised BERT pre-train step degrades sentence accuracy to 86.54%. Particularly, user act F1 decreases from 98.29% to 92.44%, and slot F1 decreases from 94.99% to 91.12%. These results demonstrate that the contextual representations learned from the large-scale general purpose unlabeled text significantly help improve user act classification and slot filling. After further removing the speaker embeddings, slot F1 drops from 91.12% to 90.43% and sentence accuracy drops from 86.54% to 86.15%, suggesting that CELT is capable of exploiting the additional discriminative information provided by the speaker embeddings. After further removing the context utterances, intent accuracy drops from 99.96% to 94.52% and sentence accuracy drops from 86.15% to 82.35%, indicating that the context utterances play a key role in intent prediction. After further removing system act embeddings, that is, no context is used, intent accuracy drops from 94.52% to 77.33%, user act F1 drops from 93.55% to 89.11% and sentence frame accuracy drops from 82.35% to 66.88%. These results show that using no context information degrades the SLU performance significantly. It is noticeable that although intent accuracy and slot F1 of CELT-NoContext (the last row in Table 4) are both lower than those of RNN-NoContext (the first row in Table 3), CELT-NoContxt achieves a better sentence accuracy (66.88%) than RNN-NoContext (64.56%), demonstrating the strength of CELT to enforce intent and slot coherence.
We further analyze the efficacy of using in-domain (ID) and out-of-domain (OOD) data for unsupervised adaptive (UA) and supervised adaptive (SA) training. As shown in Table 4, using OOD data for UA (model d) achieves small SLU improvement over model b. UA(ID) (model c) yields a significantly larger gain over model b compared to model d. However, adding OOD data to ID data for UA (model g) degrades the performance slightly compared to model c. Adding WOZ to MDC ID+OOD data for UA (model i) further degrades the performance over model g. In contrast, after applying UA(ID) (model c), adding OOD to ID for SA (model f) outperforms SA(ID) (model e), achieving 95.02% sentence frame accuracy. These results suggest that SA can benefit from both ID and OOD data, probably due to the combined loss of IC and SF. We will explore different losses for UA other than MLM and NSP losses in order to benefit from both ID and OOD data.
We also observe fast convergence speed of both CELT-UA-SA and CELT models on the GSD-overall testset, consistent with previous observations on models using BERT pre-train. For CELT-UA-SA, the user act F1 and sentence frame accuracy increase from 87.50 and 52.52 (epoch 1) to 97.19 and 90.31 (epoch 5), and keep improving until epoch 20 (98.29 and 93.00) but degrade from epoch 20 to 40. For CELT, these two results increase from 92.34 and 62.60 (epoch 1) to 97.17 and 93.79 (epoch 5), and keep improving from epoch 5 to 40 (98.47 and 95.02).
We propose Context Encoding Language Transformer for SLU facilitating exploiting various context information and different transfer learning approaches for leveraging external resources. Experimental results demonstrate that CELT with TL achieves new SOTA SLU performance on two large-scale single-turn dialogue benchmarks and one multi-turn dialogue benchmark. Future work includes improving supervised and unsupervised TL and exploring TL on knowledge bases.
-  Gokhan Tur and Renato De Mori, Spoken language understanding: Systems for extracting semantic information from speech, John Wiley & Sons, 2011.
-  A. Bhargava, Asli Çelikyilmaz, Dilek Hakkani-Tür, and Ruhi Sarikaya, “Easy contextual intent prediction and slot detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, 2013, pp. 8337–8341.
-  Puyang Xu and Ruhi Sarikaya, “Contextual domain classification in spoken language understanding systems using recurrent neural network,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014, 2014, pp. 136–140.
-  Yun-Nung Chen, Ming Sun, Alexander I. Rudnicky, and Anatole Gershman, “Leveraging behavioral patterns of mobile applications for personalized spoken language understanding,” in Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, November 09 - 13, 2015, 2015, pp. 83–86.
-  Ming Sun, Yun-Nung Chen, and Alexander I. Rudnicky, “An intelligent assistant for high-level task understanding,” in Proceedings of the 21st International Conference on Intelligent User Interfaces, IUI 2016, Sonoma, CA, USA, March 7-10, 2016, 2016, pp. 169–174.
-  Yangyang Shi, Kaisheng Yao, Hu Chen, Yi-Cheng Pan, Mei-Yuh Hwang, and Baolin Peng, “Contextual spoken language understanding using recurrent neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, 2015, pp. 5271–5275.
-  Raghav Gupta, Abhinav Rastogi, and Dilek Hakkani-Tür, “An efficient approach to encoding context for spoken language understanding,” in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018., 2018, pp. 3469–3473.
-  Yun-Nung Chen, Dilek Hakkani-Tür, Gökhan Tür, Jianfeng Gao, and Li Deng, “End-to-end memory networks with knowledge carryover for multi-turn spoken language understanding,” in Interspeech 2016, 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, September 8-12, 2016, 2016, pp. 3245–3249.
-  Ankur Bapna, Gökhan Tür, Dilek Z. Hakkani-Tür, and Larry P. Heck, “Sequential dialogue context modeling for spoken language understanding,” in Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbrücken, Germany, August 15-17, 2017, 2017, pp. 103–114.
-  Shang-Yu Su, Pei-Chieh Yuan, and Yun-Nung Chen, “How time matters: Learning time-decay attention for contextual spoken language understanding in dialogues,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), 2018, pp. 2133–2142.
Puyang Xu and Ruhi Sarikaya,
“Convolutional neural network based triangular CRF for joint intent detection and slot filling,”in
2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, December 8-12, 2013, 2013, pp. 78–83.
-  Dilek Hakkani-Tür, Gökhan Tür, Asli Çelikyilmaz, Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang, “Multi-domain joint semantic frame parsing using bi-directional RNN-LSTM,” in INTERSPEECH. 2016, pp. 715–719, ISCA.
Xiaodong Zhang and Houfeng Wang,
“A joint model of intent determination and slot filling for spoken
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, 2016, pp. 2993–2999.
-  Bing Liu and Ian Lane, “Attention-based recurrent neural network models for joint intent detection and slot filling,” in Interspeech 2016, San Francisco, CA, USA, September 8-12, 2016, 2016, pp. 685–689.
-  Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-Li Huo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-Nung Chen, “Slot-gated modeling for joint slot filling and intent prediction,” in NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), 2018, pp. 753–757.
-  Chenwei Zhang, Yaliang Li, Nan Du, Wei Fan, and Philip S. Yu, “Joint slot filling and intent detection via capsule neural networks,” CoRR, vol. abs/1812.09471, 2018.
-  Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer, “Deep contextualized word representations,” in NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), 2018, pp. 2227–2237.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever,
“Improving language understanding with unsupervised learning,”in Technical report, OpenAI., 2018.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018.
-  Ankur Bapna, Gökhan Tür, Dilek Z. Hakkani-Tür, and Larry P. Heck, “Towards zero-shot frame semantic parsing for domain scaling,” CoRR, vol. abs/1707.02363, 2017.
-  Sungjin Lee and Rahul Jha, “Zero-shot adaptive transfer for conversational language understanding,” CoRR, vol. abs/1808.10059, 2018.
-  Aditya Siddhant, Anuj Kumar Goyal, and Angeliki Metallinou, “Unsupervised transfer learning for spoken language understanding in intelligent agents,” CoRR, vol. abs/1811.05370, 2018.
-  Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean, “Google’s neural machine translation system: Bridging the gap between human and machine translation,” CoRR, vol. abs/1609.08144, 2016.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in NIPS 2017, 4-9 December 2017, Long Beach, CA, USA, 2017, pp. 6000–6010.
-  Dan Hendrycks and Kevin Gimpel, “Gaussian error linear units (gelus),” CoRR, vol. abs/1606.08415, 2016.
-  Jie Zhou and Wei Xu, “End-to-end learning of semantic role labeling using recurrent neural networks,” in ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, 2015, pp. 1127–1137.
-  Zhiheng Huang, Wei Xu, and Kai Yu, “Bidirectional LSTM-CRF models for sequence tagging,” CoRR, vol. abs/1508.01991, 2015.
-  John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 2001, 2001, pp. 282–289.
-  Gökhan Tür, Dilek Hakkani-Tür, and Larry P. Heck, “What is left to be understood in atis?,” in 2010 IEEE Spoken Language Technology Workshop, SLT 2010, Berkeley, California, USA, December 12-15, 2010, 2010, pp. 19–24.
-  Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, Maël Primet, and Joseph Dureau, “Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces,” CoRR, vol. abs/1805.10190, 2018.
-  Xiujun Li, Sarah Panda, Jingjing Liu, and Jianfeng Gao, “Microsoft dialogue challenge: Building end-to-end task-completion dialogue systems,” CoRR, vol. abs/1807.11125, 2018.
-  Pawel Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic, “Multiwoz - A large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling,” in EMNLP. 2018, pp. 5016–5026, Association for Computational Linguistics.
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler,
“Aligning books and movies: Towards story-like visual explanations
by watching movies and reading books,”
2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 19–27.
-  Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.