Recent years have seen great success in applying deep learning approaches to enhance the capabilities of virtual assistants (VAs) such as Amazon Alexa, Google Home and Apple Siri. One of the challenges for building these systems is mapping the meaning of users’ utterances, which are expressed in natural language, to machine comprehensible languageAllen (1995). An example is illustrated in Figure 1. In this utterance “Show the cheapest flight from Toronto to St. Louis”, the machine needs to map this utterance to an intent Airfare (intent detection) and to slots such as Toronto: FromLocation (slot filling). In this work, we focus on intent detection and slot filling and refer to these as Natural Language Understanding (NLU) tasks.
Previous works show that a simple deep neural architecture delivers better performance on NLU tasks when compared to traditional models such as Conditional Random Fields Collobert et al. (2011)
. Since then, deep neural architectures, predominantly recurrent neural networks, have become an indispensable part of building NLU systemsZhang and Wang (2016); Goo et al. (2018); E et al. (2019). Transformer-based architectures, as introduced more recently by Vaswani et al. (2017), have shown significant improvement over previous works on NLU tasks Chen et al. (2019); Qin et al. (2019)
. Recent studies show that although the Transformer model can learn syntactic knowledge purely by seeing examples, explicitly feeding this knowledge to such models can significantly enhance their performance on tasks such as neural machine translationSundararaman et al. (2019) and semantic role labeling Strubell et al. (2018). While incorporating syntactic knowledge has been shown to improve performance for NLU tasks Tur et al. (2011); Chen et al. (2016), both of these assume syntactic knowledge is provided by external models during training and inference time.
In this paper, we introduce a novel Transformer encoder- based architecture for NLU tasks with syntactic knowledge encoded that does not require syntactic information to be available during inference time. This is accomplished, first, by training one attention head to predict syntactic ancestors of each token. The dependency relationship between each token is obtained from syntactic dependency trees, where each word in a sentence is assigned a syntactic head that is either another word in the sentence or an artificial root symbol Dozat and Manning (2016). Adding the objective of dependency relationship prediction allows a given token to attend more to its syntactically relevant parent and ancestors. In addition to dependency parsing knowledge, we encode part of speech (POS) information in Transformer encoders because previous research shows that the POS information can help dependency parsing Nguyen and Verspoor (2018). The closest work to ours is Strubell et al. (2018). However, they focused on semantic role labeling and trained one attention head to predict the direct parent instead of all ancestors.
We compare our models with several state-of-the-art neural NLU models on two publicly available benchmarking datasets: the ATIS Hemphill et al. (1990) and SNIPS Coucke et al. (2018) datasets. The results show that our models outperform previous works. To examine the effects of adding syntactic information, we conduct an ablation study and visualize the self-attention weights in the Transformer encoder.
We define intent detection (ID) and slot filling (SF) as an utterance-level and token-level multi-class classification task, respectively. Given an input utterance with tokens, we predict an intent and a sequence of slots, one per token, as outputs. We add an empty slot denoted by “O” to represent words containing no labels. The goal is to maximize the likelihood of correct the intents and slots given input utterances.
We jointly train our model for NLU tasks (i.e., ID and SF), syntactic dependency prediction and POS tagging via multi-task learning Caruana (1993), as shown in Figure 2. For dependency prediction, we insert a syntactically-informed Transformer encoder layer after the ()th layer. In this encoder layer, one attention head is trained to predict the full ancestry for each token on the dependency parsing tree. For POS tagging, we add a POS tagging model that shares the first Transformer encoder layers with the NLU model. We describe the details of our proposed architecture below.
The input embedding model maps a sequence of token representations into a sequence of continuous embeddings , with being the embedding of a special start-of-sentence token, “[SOS]”. The embeddings are then fed into the NLU model.
Transformer Encoder Layer
The Transformer encoder layers are originally proposed in Vaswani et al. (2017)
. Each encoder layer consists of a multi-head self-attention layer and feed forward layers with layer normalization and residual connections. We stack multiple encoder layers to learn contextual embeddings of each token, each withattention heads. Suppose the output embeddings of the encoder layer is , each attention head at layer first calculates self-attention weights by the scaled dot product (1).
In (1), the query and key
are two different linear transformations of, is the dimension of the query and the key embeddings. The output of the attention head is calculated by:
in which the value is also a linear transformation of . The outputs of attention heads are concatenated as the self-attended token representations, followed by another linear transformation:
which is fed into the next feed forward layer. Residual connections and layer normalization are applied after the multi-head attention and feed forward layer, respectively.
Encoding Syntactic Dependency Knowledge
As shown in Figure 3, the syntactically-informed transformer encoder layer differs from the standard Transformer encoder layer by having one of the attention heads trained to predict the full ancestry for each token, i.e., parents, grandparents, great grandparents, etc. Different from Strubell et al. (2018), we use full ancestry prediction instead of just direct parent prediction. Later we will demonstrate the benefits of our approach in the Results section.
Given an input sequence of length , the output of a regular attention head is a matrix, in which each row contains the attention weights that a token puts on all the tokens in the input sequence. The output of the syntactically-informed attention head is also a matrix but this attention head is trained to assign weights only on the syntactic governors (i.e., ancestors) of each token.
To train this attention head, we define a loss function by the difference between the output attention weight matrix of the syntactically-informed attention head and a predefined prior attention weight matrix. The prior attention weight matrix contains the prior knowledge that each token should attend to its syntactic parse ancestors, with attention weights being higher on ancestors that are closer to that token. During training, we obtain the prior attention weights based on the outputs of a pre-trained dependency parser.
For example, in the utterance “list flights arriving in Toronto on March first”, the syntactic parse ancestors of word “first” are “March”, “arriving” and “flights” which are , and hops away on the dependency tree, respectively, as shown in Figure 4. The ancestors are syntactically meaningful for the determination of the slot of “first”, which is “arrive date, day number” in this case.
To train the attention head to assign higher weights on the ancestors of each token, we define prior attention weights of each token based on the distance between the token and its ancestors. Formally, the prior attention weights of token are defined as:
in which is the distance between token and , is the Softmax function,
is the temperature of the Softmax function controlling the variance of the attention weights over all the ancestors and. The stack of prior attention weights of all tokens is a matrix, denoted by . We train our model to decrease the difference between and the attention matrix output by the attention head at the th layer. The difference is measured by the mean of row-wise KL divergence between these two matrices, which is used as an additional loss function besides the NLU loss functions. We refer to this loss as dependency loss denoted by , formally:
in which denotes the KL divergence, and are linear transformations of , is the th row of , and is a parameter matrix. In (6) we use the biaffine attention instead of the scaled dot product attention, which has been shown to be effective for dependency parsing Dozat and Manning (2016).
Encoding Part-of-Speech Knowledge
Part-of-Speech (POS) information is important for disambiguating words with multiple meanings Alva and Hegde (2016). This is because an ambiguous word carries a specific POS in a particular context Pal et al. (2015). For instance, the word “May” could be either a verb or a noun. Being aware of its POS tag is beneficial for down-stream tasks, such as predicting the slots in the utterance “book a flight on May 1st”. Furthermore, previous studies have shown that while models trained for a sufficiently large number of steps can potentially learn underlying patterns of POS, the knowledge is imperfect Jawahar et al. (2019); Sundararaman et al. (2019). For these reasons, we explicitly train our model to perform POS tagging using the POS tags generated by a pretrained POS tagger.
Similar to slot filling, we simplify POS tagging as a token-level classification problem. We apply a MLP-based classifier on the output embeddings of theth transformer encoder layer and use cross entropy as the loss function:
in which is the predicted probability of the th token’s POS label being the th label in the POS label space, is the total number of POS labels, is the one-hot representation of the groundtruth POS label.
Intent Detection and Slot Filling
We apply a linear classifier on the embedding of the “[SOS]” token, , which is output by the last Transformer encoder layer . Cross entropy loss is used for intent detection. The loss on one utterance is defined as:
in which and are the parameters of the linear classifier, is the one-hot representation of the ground truth intent label, is the total number of intent labels and is the predicted probability of this utterance’s intent label being the th label in the intent label space.
We apply a MLP-based classifier on the embeddings output by the last Transformer encoder layer using cross entropy as the loss function. The loss on one utterance is defined as follow:
in which is the predicted probability of the th token’s slot being the th label in the slot space, is the total number of slots, and is the one-hot representations of the ground truth slot label.
We train our model via multi-task learning Caruana (1993). Our loss function is defined as:
where equals to for slot filling, for intent detection or for joint training, and and are coefficients of the dependency prediction loss and the POS tagging loss, respectively. and are treated as hyperparameters and selected based on validation performance.
|Joint Seq Hakkani-Tür et al. (2016)||87.30||96.90||94.30||92.60||-|
|Attention-based RNN Liu and Lane (2016)||87.80||96.70||95.78||-||97.98|
|Slot-Gated Goo et al. (2018)||89.27||96.86||95.42||95.41||-|
|SF-ID, SF first E et al. (2019)||91.43||97.43||95.75||97.76||-|
|SF-ID, ID first E et al. (2019)||92.23||97.29||95.80||97.09||-|
|Stack-Propagation Qin et al. (2019)||94.20||98.00||95.90||96.90||-|
|Graph LSTM Zhang et al. (2020a)||95.30||98.29||95.91||97.20||-|
|JointBERT Chen et al. (2019)||97.00||98.60||96.10||97.50||-|
We conducted experiments on two benchmark datasets: the Airline Travel Information Systems (ATIS) Hemphill et al. (1990) and SNIPS Coucke et al. (2018) datasets. The ATIS dataset has a focus on airline information and has been used as benchmark on NLU tasks. We used the same version as Goo et al. (2018); E et al. (2019) that contains 4,478 utterances for training, 500 for validation and 893 for testing. The SNIPS dataset has a focus on personal assistant commands, with a larger vocabulary size and more diverse intents and slots. It contains 13,084 utterances for training, 700 for validation and 700 for testing.
We use classification accuracy for intent detection and the F1 score for slot filling, which is the harmonic mean of precision and recall. For the SNIPS dataset, we use the same version and evaluation method as pervious worksZhang et al. (2020a). For the ATIS dataset, we find that previous works use two different evaluation methods for intent detection on utterances with multiple labels. The first method counts a prediction as correct if it is equal to one of the ground truth labels of the utterance Liu and Lane (2016). We refer this method as the single label matching method (ID-S). The second method counts a prediction as correct only if it matches all labels of the utterance Goo et al. (2018); E et al. (2019). We refer this method as the multiple label matching method (ID-M). We report both in our results.
Our experiments are implemented in PyTorchPaszke et al. (2017). The hyperparameters are selected based on the performance on the validation set. We use the Adam optimizer Kingma and Ba (2015) with , , and the weight decay fix as described in Loshchilov and Hutter (2017). Our learning rate schedule first increases the learning rate linearly from 0 to 0.0005 (warming up) and then decreases it to 0 following the values of the cosine function. We use warming up steps of the total training steps. The specific number of warming up steps is determined by validation performance. We use the implementation of the optimizer and learning rate scheduler of the Transformers library Wolf et al. (2019).
We use Stanza Qi et al. (2020) to generate training labels for POS tagging and dependency prediction. For the NLU model trained with both dependency prediction and POS tagging, and are both set to . For the NLU model trained with only dependency prediction, is set to . We used weight decay of 0.1 and dropout rate Srivastava et al. (2014) of 0.1 and 0.3 for the SNIPS and ATIS dataset, respectively. We use batch size of 32 and train each model for epochs. We report the testing results of the checkpoints achieving the best validation performance.
We use the concatenation of GloVe embeddings Pennington et al. (2014) and character embeddings Hashimoto et al. (2017) as token embeddings and keep them frozen during training. The hidden dimension of the Transformer encoder layer is 768 and the size of feed forward layer is 3072. Considering the small size of the two datasets, we only use two Transformer encoder layers in total (with , and as in Figure 2), each of which has attention heads. The dimension of and is . For slot filing, we apply the Viterbi decoding at test time. BIO is a standard format for slot filling annotation schema, as shown in Figure 1. The transition probabilities are manually set to ensure the output sequences of BIO labels to be valid, by simply specifying the probabilities of invalid transition to zero and the probabilities of valid transition to one.
We compare our proposed model with the following baseline models:
Joint Seq Hakkani-Tür et al. (2016) is a joint model for intent detection and slot filling based on the bi-directional LSTM model.
Attention-based RNN Liu and Lane (2016) is a sequence-to-sequence model with the attention mechanism.
Slot-Gated Goo et al. (2018) utilizes intent information for slot filling through the gating mechanism.
SF-ID E et al. (2019) is an architecture that enables the interaction between intent detection and slot filling.
Stack-Propagation Qin et al. (2019) is a joint model based on the Stack-Propagation framework.
Graph LSTM Zhang et al. (2020a) is based on the Graph LSTM model.
TF is the Transformer encoder-based model trained without syntactic information.
Table 1 shows the performance of the baseline and proposed models for SF and ID on the SNIPS and ATIS dataset, respectively. Overall, our proposed models achieve the best performance on the two benchmarking datasets. On the SNIPS dataset, our proposed joint model achieves an absolute F1 score and accuracy improvement of and for SF and ID, respectively, compared to the best performed baseline model without pre-training Zhang et al. (2020a). On the ATIS dataset, our proposed joint model also achieves an absolute F1 score and accuracy improvement of and for SF and ID-S, compared to the best performed baseline model for SF Zhang et al. (2020a) and ID-S Liu and Lane (2016), respectively. In addition, our proposed independent model achieves the same performance as the best performed baseline model on ID-M (E et al., 2019, SF-ID, SF first).
Besides, the model based on Transformer encoder without syntactic knowledge can achieve SOTA results on the SNIPS dataset and is slightly worse than the SOTA results on the ATIS dataset. This indicates the powerfulness of the Transformer encoder for SF and ID. Moreover, the further improvement of our models over the baseline models demonstrates the benefits of incorporating syntactic knowledge. Additionally, compared to previous works with heterogenous model structures, our models are purely based on self-attention and feed forward layers.
We also find that our proposed models can outperform the JointBERT model with pre-training Chen et al. (2019) for intent detection tasks. Compared to the JointBERT model, our proposed joint model achieves an absolute accuracy improvement of for ID on the SNIPS dataset; and our proposed independent model achieves an absolute accuracy improvement of for ID-M on the ATIS dataset. While our proposed model does not outperform the JointBERT model for SF, the performance gap is relatively small ( on SNIPS and on ATIS). It should be noted that our model does not require pre-training and the size of our model is only one seventh of the JointBERT model (16 million vs. 110 million parameters).
Previous works have shown that models like BERT can learn syntactic knowledge by self-supervision Clark et al. (2019); Manning et al. (2020). This can partially explain why the JointBERT can achieve very good results without being fed with syntactic knowledge explicitly.
Table 2 shows the ablation study results of the effects of adding different syntactic information. A first observation is that the model trained with a singe syntactic task, either dependency prediction or POS tagging, outperforms the baseline Transformer encoder-based model without syntactic information. This gives us confidence that syntactic information can help improve model performance. Moreover, training a Transformer model with both the syntactic tasks achieves even better results than training with a single syntactic task. This could be because the POS tagging task improves the performance of the dependency prediction task Nguyen and Verspoor (2018), which in turn improves the performance of SF and ID.
|TF + D||96.31||98.43||95.99||96.53||98.76|
|TF + P||96.47||98.57||95.82||97.31||98.10|
|TF + D + P||96.56||98.71||95.94||97.76||98.10|
Interestingly, we observe that the addition of dependency prediction reduces the performance of slot filling on the SNIPS dataset () when compared to the baseline Transformer encoder-based model (). There are several potential reasons. Firstly, the sentences in the SNIPS dataset are overall shorter than the ATIS dataset so that the syntactic dependency information might be less helpful. Secondly, previous work has shown that syntactic parsing performance often suffers when a named entity span has crossing brackets with the spans on the parse tree Finkel and Manning (2009). Thus, the dependency prediction performance of our model might decrease due to the presence of many name entities in the SNIPS dataset, such as song names and movie names, which could introduce noisy dependency information into the attention weights and degrade the performance on the NLU tasks.
We qualitatively examined the errors made by the Transformer encoder-based models with and without syntactic information to understand in what ways syntactic information helps improve the performance. Our major findings are:
ID errors related to preposition with nouns: Prepositions, when appearing between nouns, are used to describe their relationship. For example, in the utterance “kansas city to atlanta monday morning flights”, the preposition “to” denotes the direction from “kansas city” (departure location, noun) to “atlanta” (arrival location, noun). Without this knowledge, a model could misclassify the intent of this utterance as asking for city information rather than flight information. We found that about of the errors made by the model without syntactic information contain this pattern, whereas less than of the misclassified utterances contain this pattern for the model with syntactic information (See Appendix A for the full list).
SF errors due to POS confusion: A Word can have multiple meanings depending on context. For example, the same word “may” can be a verb expressing possibility, or as a noun referring to the fifth month of the year. We found that correctly recognizing the POS of words could potentially help reduce slot filling errors. For example, in this utterance “May I have the movie schedules for Speakeasy Theaters”, the slot for “May” should be empty, but the model without syntactic information predicts it as “Time Range”. By contrast, the model with syntactic information predicts correctly for this word, probably because the confusion of noun vs. verb for the word “May” is addressed by incorporating POS information. More examples are included in Appendix A.
|TF + Par.||96.20||98.29||95.58||98.10|
|TF + Anc.||96.31||98.43||95.99||98.76|
Parent Prediction vs. Ancestor Prediction
We compare our approach of predicting all ancestors of each token with the approach described in Strubell et al. (2018), which only predicts direct dependency parent of each token. Results in Table 3 show that the model with our approach can achieve better results for both ID and SF on the two datasets, which demonstrates that our approach is more beneficial to the NLU tasks. We hypothesize that incorporating syntactic ancestor prediction can better capture long-distance syntactic relationship. As shown in Tur et al. (2010), long distance dependencies are important for slot filling. For example, in the utterance “Find flights to LA arriving in no later than next Monday”, a 6-gram context is needed to figure out that “Monday” is the arrival date instead of the departure date.
Visualization of Attention Weights
We visualize the attention weights output by models trained with and without syntactic information to understand what the models have learned by incorporating syntactic information. We select the utterance “show me the flights on american airlines which go from st. petersburg to ontario california by way of st. louis” from the ATIS testing set. Only the model trained with syntactic information predicts the slot labels correctly. As shown in Figure 5, the model without syntactic information has simple attention patterns on both layers, such as looking backward and looking forward. Other attention heads seem to be random and less informative.
In contrast, the model with syntactic information has more informative attention patterns. On the first layer, all the attention heads present simple but diverse patterns. Besides looking forward and backwards, the second attention head looks at both directions for each token. On the second layer, however, we observe more complex patterns and long-distance attention which could account for more task-oriented operations. Therefore, it is possible that the Transformer encoder learns attention weights better with syntactic information supervision so that the encoder can leave more power for the end task.
. Early work has primarily focused on using a traditional machine learning classifier such as CRFsHaffner et al. (2003). Recently, there has been an increasing application of neural models on NLU tasks. These approaches, primarily based on RNNs, have shown that neural approaches outperform traditional models Mesnil et al. (2014); Tur et al. (2012); Zhang and Wang (2016); Goo et al. (2018); E et al. (2019). For example, Mesnil et al (2015) employed RNNs for slot filling and found an relative improvement of F1 compared to CRF Mesnil et al. (2014). Some works also explored Transformer encoder and graph LSTM-based neural architectures Chen et al. (2019); Zhang et al. (2020a).
Syntactic information has been shown to be beneficial to many tasks, such as neural machine translation Akoury et al. (2019), semantic role labeling Strubell et al. (2018), and machine reading comprehension Zhang et al. (2020b). Research on NLU tasks has also shown that incorporating syntactic information into machine learning models can help improve the performance. Moschitti et al. Moschitti et al. (2007) used syntactic information for slot filling, where the authors used a tree kernel function to encode the structural information acquired by a syntactic parser. An extensive analysis on the ATIS dataset revealed that most NLU errors are caused by complex syntactic characteristics, such as prepositional phrases and long distance dependencies Tur et al. (2010). Tur et al. Tur et al. (2011) proposed a rule-based dependency parsing based sentence simplification method to augment the input utterances based on the syntactic structure. Compared to previous works, our work is the first to encode syntactical knowledge into end-to-end neural models for intent detection and slot filling.
In this paper, we propose to encode syntactic knowledge into the Transformer encoder-based model for intent detection and slot filling. Experimental results indicate that a model with only two Transformer encoder layers can already match or even outperform the SOTA performance on two benchmark datasets. Moreover, we show that the performance of this baseline model can be further improved by incorporating syntactical supervision. The visualization of the attention weights also reveals that syntactical supervision can help the model to better learn syntactically-related patterns. For future work, we will evaluate our approach with larger model sizes on larger scale datasets containing more syntactically complex utterances. Furthermore, we will investigate incorporating syntactic knowledge into models pretrained by self-supervision and applying those models on the NLU tasks.
We would like to thank Siegfried Kunzmann, Nathan Susanj, Ross McGowan, and anonymous reviewers for their insightful feedback that greatly improved our paper.
- Syntactically supervised transformers for faster neural machine translation. arXiv preprint arXiv:1906.02780. Cited by: Related Work.
- Natural language understanding (2nd ed.). Benjamin-Cummings Publishing Co., Inc., USA. External Links: Cited by: Introduction.
- Hidden markov model for pos tagging in word sense disambiguation. In 2016 International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS), pp. 279–284. Cited by: Encoding Part-of-Speech Knowledge.
- Multitask learning: a knowledge-based source of inductive bias. In ICML’93 Proceedings of the Tenth International Conference on International Conference on Machine Learning, pp. 41–48. Cited by: Multi-task Learning, Proposed Model.
- BERT for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909. Cited by: Introduction, Table 1, 7th item, Results, Related Work.
- Syntax or semantics? knowledge-guided joint semantic frame parsing. In 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 348–355. Cited by: Introduction.
- What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341. Cited by: Results.
- Natural language processing (almost) from scratch. Journal of machine learning research 12 (ARTICLE), pp. 2493–2537. Cited by: Introduction.
- Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190. Cited by: Introduction, Datasets.
- BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: 7th item.
- Deep biaffine attention for neural dependency parsing. arXiv preprint arXiv:1611.01734. Cited by: Introduction, Encoding Syntactic Dependency Knowledge.
- A novel bi-directional interrelated model for joint intent detection and slot filling. In ACL 2019 : The 57th Annual Meeting of the Association for Computational Linguistics, pp. 5467–5471. Cited by: Introduction, Table 1, 4th item, Datasets, Evaluation Metrics, Results, Related Work.
Joint parsing and named entity recognition. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 326–334. Cited by: Ablation Study.
- Slot-gated modeling for joint slot filling and intent prediction. In NAACL HLT 2018: 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 2, pp. 753–757. Cited by: Introduction, Table 1, 3rd item, Datasets, Evaluation Metrics, Related Work.
- How may i help you?. Speech communication 23 (1-2), pp. 113–127. Cited by: Related Work.
- Optimizing svms for complex call classification. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., Vol. 1, pp. I–I. Cited by: Related Work.
- Multi-domain joint semantic frame parsing using bi-directional rnn-lstm.. In Interspeech 2016, pp. 715–719. Cited by: Table 1, 1st item.
A joint many-task model: growing a neural network for multiple nlp tasks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1923–1933. Cited by: Implementation Details.
- The atis spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990, Cited by: Introduction, Datasets.
- What does bert learn about the structure of language?. Cited by: Encoding Part-of-Speech Knowledge.
- Adam: a method for stochastic optimization. In ICLR 2015 : International Conference on Learning Representations 2015, Cited by: Implementation Details.
- Attention-based recurrent neural network models for joint intent detection and slot filling. In Interspeech 2016, pp. 685–689. Cited by: Table 1, 2nd item, Evaluation Metrics, Results.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: Implementation Details.
- Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences. Cited by: Results.
- Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (3), pp. 530–539. Cited by: Related Work.
Spoken language understanding with kernels for syntactic/semantic structures.
2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), pp. 183–188. Cited by: Related Work.
- An improved neural network model for joint pos tagging and dependency parsing. arXiv preprint arXiv:1807.03955. Cited by: Introduction, Ablation Study.
- An approach to speed-up the word sense disambiguation procedure through sense filtering. arXiv preprint arXiv:1610.06601. Cited by: Encoding Part-of-Speech Knowledge.
- Automatic differentiation in pytorch. In NIPS-W, Cited by: Implementation Details.
- Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: Implementation Details.
- Evaluation of spoken language systems: the atis domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990, Cited by: Related Work.
- Stanza: a Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Cited by: Implementation Details.
- A stack-propagation framework with token-level intent detection for spoken language understanding. In 2019 Conference on Empirical Methods in Natural Language Processing, pp. 2078–2087. Cited by: Introduction, Table 1, 5th item.
- Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: Implementation Details.
- Linguistically-informed self-attention for semantic role labeling. In EMNLP 2018: 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5027–5038. Cited by: Introduction, Introduction, Encoding Syntactic Dependency Knowledge, Encoding Syntactic Dependency Knowledge, Parent Prediction vs. Ancestor Prediction, Related Work.
- Syntax-infused transformer and bert models for machine translation and natural language understanding. arXiv preprint arXiv:1911.06156. Cited by: Introduction, Encoding Part-of-Speech Knowledge.
- Towards deeper understanding: deep convex networks for semantic utterance classification. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5045–5048. Cited by: Related Work.
- Sentence simplification for spoken language understanding. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5628–5631. Cited by: Introduction, Related Work.
- What is left to be understood in atis?. In 2010 IEEE Spoken Language Technology Workshop, pp. 19–24. Cited by: Parent Prediction vs. Ancestor Prediction, Related Work.
- Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5998–6008. Cited by: Introduction, Transformer Encoder Layer.
- HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: Implementation Details.
Graph lstm with context-gated mechanism for spoken language understanding.
AAAI 2020 : The Thirty-Fourth AAAI Conference on Artificial Intelligence34 (5), pp. 9539–9546. Cited by: Table 1, 6th item, Evaluation Metrics, Results, Related Work.
- A joint model of intent determination and slot filling for spoken language understanding.. In IJCAI, Vol. 16, pp. 2993–2999. Cited by: Introduction, Related Work.
- SG-net: syntax-guided machine reading comprehension.. In AAAI, pp. 9636–9643. Cited by: Related Work.
Below lists examples of the intent detection errors made by the model without syntactic information that are related to one specific grammar pattern between prepositions and nouns.
cleveland to kansas city arrive monday before 3 pm
kansas city to atlanta monday morning flights
new york city to las vegas and memphis to las vegas on Sunday
Below lists examples of the slot filling errors made by the model without syntactic information that contain POS confusion.
cleveland to kansas city arrive monday before 3 pm
new york city to las vegas and memphis to las vegas on Sunday
baltimore to kansas city economy
The Transformer encoder-based model without syntactic information made mistakes on all these utterances. The model trained with POS tagging and the model trained with both POS tagging and dependency prediction fail on the last utterance in the list below. The model trained with dependency prediction does not make any mistakes on all these utterances. We underline the words that are assigned to wrong slots by the model without syntactic information.
book a reservation for velma an a and rebecca for an american pizzeria at (correct: ; prediction: ) Am in MA
Where is Belgium located (correct: ; prediction: )
May(correct: ; prediction: ) I have the movie schedules for Speakeasy Theaters