In task-oriented spoken dialogue systems, the user and the system are engaged in interactions that can span multiple turns. A key challenge here is that the user can reference entities introduced in previous dialogue turns. For example, if a user request for what’s the weather in arlington is followed by how about tomorrow, the dialogue system has to keep track of the entity arlington being referenced. In slot-based spoken dialogue systems, tracking the entities in context can be cast as slot carryover task – only the relevant slots from the dialogue context are carried over to the current turn. Recent work by Naik et al. (2018) describes a scalable multi-domain neural network architecture to address the task in a diverse schema setting. However, this approach treats every slot as independent. Consequently, as shown in our experiments, this results in lower performance when the contextual slot being referenced is associated with dialogue turns that are further away from the current turn. We posit that modeling slots jointly is essential for improving the accuracy over long distances, particularly when slots are correlated. We motivate this with an example conversation in Figure 1. In this example, the slots WeatherCity/WeatherState, need to be carried over together from dialogue history as they are correlated. However, the model in Naik et al. (2018) has no information about this slot interdependence and may choose to carryover only one of the slots. In this work, we alleviate this issue by proposing two novel neural network architectures – one based on pointer networks (Vinyals et al., 2015) and another based on self-attention with transformers (Vaswani et al., 2017) – that can learn to jointly predict jointly whether a subset of related slots should be carried over from dialogue history. To validate our approach, we conduct thorough evaluations on both the publicly available DSTC2 task (Henderson et al., 2014), as well as our internal dialogue dataset collected from a commercial digital assistant. In Section 4.3, we show that our proposed approach improve slot carryover accuracy over the baseline systems over longer dialogue contexts. A detailed error analysis reveals that our proposed models are more likely to utilize “anchor” slots – slots tagged in the current utterance – to carry over long-distance slots from context. To summarize we make the following contributions in this work:
We improve upon the slot carryover model architecture in Naik et al. (2018) by introducing approaches for modeling slot interdependencies. We propose two neural network models based on pointer networks and transformer networks that can make joint predictions over slots.
We provide a detailed analysis of the proposed models both on an internal benchmark and public dataset. We show that contextual encoding of slots and modeling slot interdependencies is essential for improving performance of slot carryover over longer dialogue contexts. Transformer architectures with self attention provide the best performance overall.
2 Problem Formulation
A dialogue is formulated as a sequence of utterances, alternatively uttered by a user () and the system agent ():
where each element is an utterance. A subscript denotes the utterance distance which measures the offset from the most recent user utterance (). The -th token of an utterance with distance is denoted as . A slot in a dialogue is defined as a key-value pair that contains an entity information, e.g. [City:San Francisco]. Each slot can be determined by the utterance distance , slot key , and a span over the tokens of the utterance with slot value represented as . Given a dialogue history and a set of candidate slots , the context carryover task is addressed by deciding which slots should be carried over. The previous work (Naik et al., 2018) addressed the task as a binary classification problem and each slot
is classified independently. In contrast, our proposed models can explicitly capture slot interactions and make joint predictions of all slots. We show formulations of both model types below,
where denotes a binary classification model (Naik et al., 2018), denotes our joint prediction models.
3.1 General architecture
We follow the approach in Naik et al. (2018), where, given a dialogue , we construct a candidate set of slots from the context by leveraging the slot key embeddings to find the nearest slot keys that are associated with the current turn.
A model, given a candidate slot (a slot key, a span in the history and a distance), results in a fixed-length vector representation of a slot:, where is the slot, is the full history.
We serialize the utterances in the dialogue and use BiLSTM to encode the context as a fixed-length vector .
The intent of the most recent utterance determined by an NLU module is also encoded as a fixed-length vector by averaging the tokens in the intent. We average the word embeddings of the tokens associated with the intent to get the intent embedding.
Given the encoded vector representations of the slots, the context vector , the intent vector , produce a subset of the slot ids:
The overall architecture of the model is shown in Figure 2. We elaborate on the specific designs of these components under this general architecture.
3.2 Slot Encoder Variants
In this section, we describe the different encoding methods that we use to encode slots. We average the word embeddings of the tokens in the slot key as the slot key encoding:
where is the embedding vector of token . For the slot value (the tokens ), we propose following encoding approaches.
The first is to average the token embeddings of the tokens in the slot value:
To get improved contextualized representation of the slot value in dialogue, we also use neural network models to encode slots. We experimented with bidirectional LSTM Hochreiter and Schmidhuber (1997) model for slot encoding. LSTMs are equipped with feedback loops in their recurrent layer, which helps store contextual information over a long history. We encode all dialogue utterances with BiLSTM to obtain contextualized vector representations for each token , then average the output hidden states of the tokens in the span to get the slot value encoding.
may contain important signals. This integer, being odd or even, provides information on whether this utterance is uttered by a user or the system. The smaller it is, the closer a slot is to the current utterance, hence implicitly more probable to be carried over. Building on these intuitions, we encode the distance as a small vector (, 4 dimensions) and append it to the overall slot encoding:
3.3 Decoder Variants
Pointer network decoder
We adopt the architecture of the pointer network Vinyals et al. (2015) as a method to perform joint prediction of the slots to be carried over. Pointer networks, a variant of Seq2Seq Bahdanau et al. (2015); Sutskever et al. (2014); Luong et al. (2015) model, instead of transducing the input sequence into another output sequence, yields a succession of soft pointers (attention vectors) to the input sequence, hence producing an ordering of the elements of a variable-length input sequence. We use a pointer network to select a subset of the slots from the input slot set. The input slot encodings are ordered as a sequence, then fed into a bidirectional LSTM encoder to yield a sequence of encoded hidden states. We experiment with different slot orderings as described in section 4.
Here a special sentinel token is appended to the beginning of the input to the pointer network – when decoding, once the output pointer points to this token, the decoding process stops. Given the hidden states, , the decoding process at every time step is computed and updated as shown in 1.
Contrary to normal attention-based models which directly uses the decoder state () as the query, we incorporate the context vector () and the intent vector () into the attention query. The query vector is a concatenation of the three components:
We use the general Luong attention Luong et al. (2015) scoring function (bilinear form):
As a subset output is desired, the output should be distinct at each step . To this end, we utilize a dynamic mask in the decoding process: for every input slot encoding a Boolean mask variable is set to . Once a specific slot is generated, it is crossed out – its corresponding mask is set to , and further pointers will never attend to this slot again. Hence distinctness of the output sequence is ensured.
The pointer network as introduced previously yields a succession of pointers that select slots based on attention scores, which allows the model to look back and forth over entire slot sequence for slot dependency modeling. Similar to the pointer network, the self-attention mechanism is also capable of modeling relationships between all slots in the dialogue, regardless of their respective positions. To compute the representation of any given slot, the self-attention model compares it to every other slot in the dialogue. The result of these comparisons is attention scores which determine how much each of the other slots should contribute to the representation of the given slot. In this section, we also propose to use the self-attention mechanism with the neural transformer networksVaswani et al. (2017) to model slot interdependencies for the task. One major component in the transformer is the multi-head self-attention unit. Rather than only computing the attention once, the multi-head mechanism runs through the scaled dot-product attention multiple times and allows the model to jointly attend to information from different perspectives at different positions, which is empirically shown to be more powerful than a single attention head Vaswani et al. (2017). In our configurations, we increase the number of heads , as described in section 4. The independent attention head
outputs are simply concatenated and linearly transformed into the expected output. Given the input slot encodings, we compute the self-attention as follows:
where the superscript is the head number. We model the query construction, Equation 12, and the attention score, Equation 14, in the same way as their counterparts (Equation 10 and Equation 11) in the previous pointer network model. The self-attended representation of slot , , is a representation of slot with the relations to all other slots taken into account. We derive the final decision over whether to carry over a slot as a 2-layer feedforward neural network atop the features , , context vector ) and the intent vector ():
This creates a highway network connection Srivastava et al. (2015) that connects the input and the self-attention transformed encodings.
We evaluate our approaches on both internal and external datasets. The internal dataset contains dialogues collected specifically for reference resolution, while the external dataset was collected for dialogue state tracking.
This dataset is made up of a subset of user-initiated dialogue data collected from a commercial voice-based digital assistant. This dataset has K dialogues from domains – Music, Q&A, Video, Weather, Local Businesses and Home Automation. Each domain has its own schema. There are distinct slot keys per domain and only of these keys are reused in more than one domain. To handle dialogue data belonging to a diverse schema, slots in dialogue are converted into candidate slots in the schema associated with the current domain. We follow the same slot candidate generation recipe by leveraging slot key embedding similarities as in Naik et al. (2018). These candidates are then presented to the models for selecting a subset of relevant candidate slots. Statistics for the candidate slots in the train, development, and test sets broken down by slot distances are shown in Table 1.
The DSTC2 dataset (Henderson et al., 2014) contains system-initiated dialogues between human and dialogue systems in restaurant booking domain. We use top ASR hypothesis as the user utterance and use all the slots from n-best SLU with score as candidate slots. These candidates are then presented to the models for selecting a subset of candidate slots which represent the user goal. Statistics for the candidate slots in the train, development, and test sets broken down by slot distances are shown in Table 2. Since only the user mentioned slots contribute to the user-goal, there are no candidates with odd-numbered slot distances.
|Decoder||Slot Encoder||Slot Ordering||Slot distance|
|Baseline (Naik et al., 2018)||0.8818||0.6551||0.0000||0.8506|
|Decoder||Slot Encoder||Slot Ordering||Slot distance|
|Baseline (Naik et al., 2018)||0.9242||0.9111||0.9134||0.8799|
4.2 Experimental setup
For all the models, we initialize the word embeddings using fastText embeddings (Lample et al., 2018). The models are trained using mini-batch SGD with Adam optimizer (Kingma and Ba, 2015) with a learning rate of to minimize the negative log-likelihood loss. We set the dropout rate of for our models during training. In our experiments, we use dimensions for the LSTM hidden states in the pointer network encoder and decoder. Our transformer decoder has layer, heads, for the projection size of keys and values in the attention heads. We do not use positional encoding for the transformer decoder. All pointer network model setups are trained for epochs, our transformer models are trained for epochs. For evaluation on the test set, we pick the best model based on performance on dev set. We use standard definitions of precision, recall, and F1 by comparing the reference slots with the model hypothesis slots.
4.3 Results and discussion
We compare our models against the baseline model – encoder-decoder with word attention architecture described by Naik et al. (2018). Table 3 shows the performance of the models for slots at different distances on Internal dataset.
Impact of slot ordering
Using pointer network model, we experiment with the following slot orderings to measure the impact of the order on carryover performance. no order – slots are ordered completely randomly. turn-only order – slots are ordered based on their slot distance, but the slots with the same distance (i.e., candidates generated from the same contextual turn) are ordered randomly. temporal order – slots are ordered based on the order in which they occur in the dialogue. Partial ordering slots across turns i.e., turn-only order significantly improves the carryover performance as compared to using no order. Further, enforcing within distance order using temporal order improves the overall performance slightly, but we see drop in F1 by 7 points for slots at distance 3. indicating that a strict ordering might hurt model accuracy.
Impact of slot encoding
Here, we compare slot value representations obtained by averaging pre-trained embeddings (CTXavg) with contextualized slot value representation obtained from BiLSTM over complete dialogue(CTXLSTM). The results in Table 3, show that contextualized slot value representation substantially improves model performance compared to the non-contextual representation. This is aligned with the observations on other tasks using contextual word vectors (Peters et al., 2018a; Howard and Ruder, 2018; Devlin et al., 2019).
Impact of decoder
Compared to the baseline model, both the pointer network model and the transformer model are able to carry over longer dialogue context due to being able to model the slot interdependence. With the transformer network, we completely forgo ordering information. Though the slot embedding includes distance feature , the actual order in which the slots are arranged does not matter. We see improvement in carryover performance for slots at all distances. While the pointer network seems to deal with longer context better, the transformer architecture still gives us the best overall performance. For completeness, Table 4 shows the performance on DSTC2 public dataset, where similar conclusions hold.
4.4 Error Analysis
To gain deeper insight into the ability of the models to learn and utilize slot co-occurrence patterns, we measure the models’ performance on buckets obtained by slicing the data using SFinal – total number of slots after resolution (i.e after context carryover) and SCarry – total number of slots carried from context. For example, in a dialogue, if the current turn utterance has 2 slots, and after reference resolution if we carry 3 slots from context, the values for SFinal and SCarry would be 5 and 3 respectively. Figure 4 shows the number of instances in each of these buckets and performance of the baseline model, the best pointer network and transformer models on the internal dataset. We notice that the baseline model performs better than the proposed models for instances in the table diagonal (SFinal = SCarry). These are the instances where the current turn has no slots, and all the necessary slots for the turn have to be carried from historical context. Proposed models perform better in off-diagonal buckets. We hypothesize that the proposed models use anchor slots (slots in current utterance having slot distance 0 which are always positive) and learn slot co-occurrence of candidate slots from context with these anchor slots to improve resolution (i.e., carryover) from longer distances.
5 Related Work
Figure 5 shows a typical pipelined approach to spoken dialogue (Tur and De Mori, 2011), and where the context carryover system fits into the overall architecture. The context carryover system takes as input, an interpretation output by NLU – typically represented as intents and slots (Wang et al., 2011) – and outputs another interpretation that contains slots from the dialogue context that are relevant to the current turn. The output from context carryover is then fed to the dialogue manager to take the next action. Resolving references to slots in the dialogue plays a vital role in tracking conversation states across turns (Çelikyilmaz et al., 2014). Previous work, e.g., Bhargava et al. (2013); Xu and Sarikaya (2014); Bapna et al. (2017), focus on better leveraging dialogue contexts to improve SLU performance. However, in commercial systems like Siri, Google Assistant, and Alexa, the NLU component is a diverse collection of services spanning rules and statistical models. Typical end-to-end approaches (Bapna et al., 2017) which require back-propagation through the NLU sub-systems are not feasible in this setting.
Dialogue state tracking
Dialogue state tracking (DST) focuses on tracking conversational states as well. Traditional DST models rely on hand-crafted semantic delexicalization to achieve generalization (Henderson et al., 2014; Zilka and Jurcícek, 2015; Mrksic et al., 2015). Mrksic et al. (2017) utilize representation learning for states rather than using hand-crafted features. These approaches only operate on fixed ontology and do not generalize well to unknown slot key-value pairs. Rastogi et al. (2017) address this by using sophisticated candidate generation and scoring mechanism while Xu and Hu (2018) use a pointer network to handle unknown slot values. Zhong et al. (2018)
share global parameters between estimates for each slot to address extraction of rare slot-value pairs and achieve state-of-the-art on DST. In context carryover, our state tracking does not rely on the definition of user goals and is instead focused on resolving slot references across turns. This approach scales when dealing with multiple spoken language systems, as we do not track the belief states explicitly.
Our problem is closely related to coreference resolution, where mentions in the current utterance are to be detected and linked to previously mentioned entities. Previous work on coreference resolution have relied on clustering Bagga and Baldwin (1998); Stoyanov and Eisner (2012) or comparing mention pairs (Durrett and Klein, 2013; Wiseman et al., 2015; Sankepally et al., 2018). This has two problems. (1) most traditional methods for coreference resolution follows a pipeline approach, with rich linguistic features, making the system cumbersome and prone to cascading errors; (2) Zero pronouns, intent references and other phenomena in spoken dialogue are hard to capture with this approach (Rao et al., 2015). These problems are circumvented in our approach for slot carryover.
In this work, we proposed an improvement to the slot carryover task as defined in Naik et al. (2018). Instead of independent decisions across slots, we proposed two architectures to leverage the slot interdependence – a pointer network architecture and a self-attention and transformer based architecture. Our experiments show that both proposed models are good at carrying over slots over longer dialogue context. The transformer model with its self attention mechanism gives us the best overall performance. Furthermore, our experiments show that temporal ordering of slots in the dialogue matter, since recent slots are more likely to be referred to by users in a spoken dialogue system. Moreover, contextualized encoding of slots is also important, which follows the trend of contextualized embeddings (Peters et al., 2018b). For future work, we plan to improve these models by encoding the actual dialogue timing information into the contextualized slot embeddings as additional signals. We also plan on exploring the impact of pre-trained representations (Devlin et al., 2019) trained specifically over large-scale dialogues as another way to get improved contextualized slot embeddings.
- Bagga and Baldwin (1998) Amit Bagga and Breck Baldwin. 1998. Entity-based cross-document coreferencing using the vector space model. In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, COLING-ACL ’98, August 10-14, 1998, Université de Montréal, Montréal, Quebec, Canada. Proceedings of the Conference., pages 79–85.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Bapna et al. (2017) Ankur Bapna, Gökhan Tür, Dilek Z. Hakkani-Tür, and Larry P. Heck. 2017. Sequential dialogue context modeling for spoken language understanding. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbrücken, Germany, August 15-17, 2017, pages 103–114.
- Bhargava et al. (2013) A. Bhargava, Asli Çelikyilmaz, Dilek Hakkani-Tür, and Ruhi Sarikaya. 2013. Easy contextual intent prediction and slot detection. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pages 8337–8341.
Çelikyilmaz et al. (2014)
Asli Çelikyilmaz, Zhaleh Feizollahi, Dilek Z. Hakkani-Tür, and
Ruhi Sarikaya. 2014.
referring expressions in conversational dialogs for natural user interfaces.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 2094–2104.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Durrett and Klein (2013) Greg Durrett and Dan Klein. 2013. Easy victories and uphill battles in coreference resolution. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1971–1982, Seattle, Washington, USA. Association for Computational Linguistics.
- Henderson et al. (2014) Matthew Henderson, Blaise Thomson, and Steve J. Young. 2014. Word-based dialog state tracking with recurrent neural networks. In Proceedings of the SIGDIAL 2014 Conference, The 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 18-20 June 2014, Philadelphia, PA, USA, pages 292–299.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 328–339.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Lample et al. (2018) Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
- Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal. Association for Computational Linguistics.
- Mrksic et al. (2015) Nikola Mrksic, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gasic, Pei-hao Su, David Vandyke, Tsung-Hsien Wen, and Steve J. Young. 2015. Multi-domain dialog state tracking using recurrent neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 2: Short Papers, pages 794–799.
- Mrksic et al. (2017) Nikola Mrksic, Diarmuid Ó Séaghdha, Tsung-Hsien Wen, Blaise Thomson, and Steve J. Young. 2017. Neural belief tracker: Data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1777–1788.
- Naik et al. (2018) Chetan Naik, Arpit Gupta, Hancheng Ge, Lambert Mathias, and Ruhi Sarikaya. 2018. Contextual slot carryover for disparate schemas. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018., pages 596–600.
- Peters et al. (2018a) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 2227–2237.
- Peters et al. (2018b) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018b. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
- Rao et al. (2015) Sudha Rao, Allyson Ettinger, Hal Daumé III, and Philip Resnik. 2015. Dialogue focus tracking for zero pronoun resolution. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 494–503, Denver, Colorado. Association for Computational Linguistics.
- Rastogi et al. (2017) Abhinav Rastogi, Dilek Hakkani-Tür, and Larry P. Heck. 2017. Scalable multi-domain dialogue state tracking. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017, Okinawa, Japan, December 16-20, 2017, pages 561–568.
- Sankepally et al. (2018) Rashmi Sankepally, Tongfei Chen, Benjamin Van Durme, and Douglas W. Oard. 2018. A test collection for coreferent mention retrieval. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018, pages 1209–1212.
- Srivastava et al. (2015) Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Highway networks. Computing Research Repository, arXiv:1505.00387.
- Stoyanov and Eisner (2012) Veselin Stoyanov and Jason Eisner. 2012. Easy-first coreference resolution. In COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8-15 December 2012, Mumbai, India, pages 2519–2534.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112.
- Tur and De Mori (2011) Gokhan Tur and Renato De Mori. 2011. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. John Wiley and Sons.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6000–6010.
- Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2692–2700.
- Wang et al. (2011) Ye-Yi Wang, Li Deng, and Alex Acero. 2011. Semantic Frame-Based Spoken Language Understanding, pages 35–80. Wiley.
- Wiseman et al. (2015) Sam Wiseman, Alexander M. Rush, Stuart M. Shieber, and Jason Weston. 2015. Learning anaphoricity and antecedent ranking features for coreference resolution. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages 1416–1426.
- Xu and Hu (2018) Puyang Xu and Qi Hu. 2018. An end-to-end approach for handling unknown slot values in dialogue state tracking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 1448–1457.
- Xu and Sarikaya (2014) Puyang Xu and Ruhi Sarikaya. 2014. Contextual domain classification in spoken language understanding systems using recurrent neural network. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014, pages 136–140.
- Zhong et al. (2018) Victor Zhong, Caiming Xiong, and Richard Socher. 2018. Global-locally self-attentive encoder for dialogue state tracking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 1458–1467.
- Zilka and Jurcícek (2015) Lukás Zilka and Filip Jurcícek. 2015. Incremental lstm-based dialog state tracker. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015, Scottsdale, AZ, USA, December 13-17, 2015, pages 757–762.