Encoding Syntactic Knowledge in Transformer Encoder for Intent Detection and Slot Filling

12/21/2020 ∙ by Jixuan Wang, et al. ∙ UNIVERSITY OF TORONTO Amazon 0

We propose a novel Transformer encoder-based architecture with syntactical knowledge encoded for intent detection and slot filling. Specifically, we encode syntactic knowledge into the Transformer encoder by jointly training it to predict syntactic parse ancestors and part-of-speech of each token via multi-task learning. Our model is based on self-attention and feed-forward layers and does not require external syntactic information to be available at inference time. Experiments show that on two benchmark datasets, our models with only two Transformer encoder layers achieve state-of-the-art results. Compared to the previously best performed model without pre-training, our models achieve absolute F1 score and accuracy improvement of 1.59 for slot filling and intent detection on the SNIPS dataset, respectively. Our models also achieve absolute F1 score and accuracy improvement of 0.1 0.34 over the previously best performed model. Furthermore, the visualization of the self-attention weights illustrates the benefits of incorporating syntactic information during training.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Recent years have seen great success in applying deep learning approaches to enhance the capabilities of virtual assistants (VAs) such as Amazon Alexa, Google Home and Apple Siri. One of the challenges for building these systems is mapping the meaning of users’ utterances, which are expressed in natural language, to machine comprehensible language 

Allen (1995). An example is illustrated in Figure 1. In this utterance “Show the cheapest flight from Toronto to St. Louis”, the machine needs to map this utterance to an intent Airfare (intent detection) and to slots such as Toronto: FromLocation (slot filling). In this work, we focus on intent detection and slot filling and refer to these as Natural Language Understanding (NLU) tasks.

Previous works show that a simple deep neural architecture delivers better performance on NLU tasks when compared to traditional models such as Conditional Random Fields Collobert et al. (2011)

. Since then, deep neural architectures, predominantly recurrent neural networks, have become an indispensable part of building NLU systems 

Zhang and Wang (2016); Goo et al. (2018); E et al. (2019). Transformer-based architectures, as introduced more recently by Vaswani et al. (2017), have shown significant improvement over previous works on NLU tasks Chen et al. (2019); Qin et al. (2019)

. Recent studies show that although the Transformer model can learn syntactic knowledge purely by seeing examples, explicitly feeding this knowledge to such models can significantly enhance their performance on tasks such as neural machine translation 

Sundararaman et al. (2019) and semantic role labeling Strubell et al. (2018). While incorporating syntactic knowledge has been shown to improve performance for NLU tasks Tur et al. (2011); Chen et al. (2016), both of these assume syntactic knowledge is provided by external models during training and inference time.

Figure 1: An example of NLU tasks.

In this paper, we introduce a novel Transformer encoder- based architecture for NLU tasks with syntactic knowledge encoded that does not require syntactic information to be available during inference time. This is accomplished, first, by training one attention head to predict syntactic ancestors of each token. The dependency relationship between each token is obtained from syntactic dependency trees, where each word in a sentence is assigned a syntactic head that is either another word in the sentence or an artificial root symbol Dozat and Manning (2016). Adding the objective of dependency relationship prediction allows a given token to attend more to its syntactically relevant parent and ancestors. In addition to dependency parsing knowledge, we encode part of speech (POS) information in Transformer encoders because previous research shows that the POS information can help dependency parsing Nguyen and Verspoor (2018). The closest work to ours is Strubell et al. (2018). However, they focused on semantic role labeling and trained one attention head to predict the direct parent instead of all ancestors.

We compare our models with several state-of-the-art neural NLU models on two publicly available benchmarking datasets: the ATIS Hemphill et al. (1990) and SNIPS Coucke et al. (2018) datasets. The results show that our models outperform previous works. To examine the effects of adding syntactic information, we conduct an ablation study and visualize the self-attention weights in the Transformer encoder.

Problem Definition

We define intent detection (ID) and slot filling (SF) as an utterance-level and token-level multi-class classification task, respectively. Given an input utterance with tokens, we predict an intent and a sequence of slots, one per token, as outputs. We add an empty slot denoted by “O” to represent words containing no labels. The goal is to maximize the likelihood of correct the intents and slots given input utterances.

Proposed Model

We jointly train our model for NLU tasks (i.e., ID and SF), syntactic dependency prediction and POS tagging via multi-task learning Caruana (1993), as shown in Figure 2. For dependency prediction, we insert a syntactically-informed Transformer encoder layer after the ()th layer. In this encoder layer, one attention head is trained to predict the full ancestry for each token on the dependency parsing tree. For POS tagging, we add a POS tagging model that shares the first Transformer encoder layers with the NLU model. We describe the details of our proposed architecture below.

Input Embedding

The input embedding model maps a sequence of token representations into a sequence of continuous embeddings , with being the embedding of a special start-of-sentence token, “[SOS]”. The embeddings are then fed into the NLU model.

Figure 2: A high level overview of the proposed architecture. Note that , and

all refer to number of layers that can vary depending on implementation. “MLP” refers to a multi-layer perceptron (MLP).

Transformer Encoder Layer

The Transformer encoder layers are originally proposed in Vaswani et al. (2017)

. Each encoder layer consists of a multi-head self-attention layer and feed forward layers with layer normalization and residual connections. We stack multiple encoder layers to learn contextual embeddings of each token, each with

attention heads. Suppose the output embeddings of the encoder layer is , each attention head at layer first calculates self-attention weights by the scaled dot product (1).

(1)

In (1), the query and key

are two different linear transformations of

, is the dimension of the query and the key embeddings. The output of the attention head is calculated by:

(2)

in which the value is also a linear transformation of . The outputs of attention heads are concatenated as the self-attended token representations, followed by another linear transformation:

(3)

which is fed into the next feed forward layer. Residual connections and layer normalization are applied after the multi-head attention and feed forward layer, respectively.

Figure 3: Overview of the syntactically-informed Transformer layer. One out of the

attention heads is trained for predicting syntactic parse ancestors of each token. For each token, this attention head outputs a distribution over all positions in the sentence, which corresponds to the probability of each token being the ancestor of this token. The loss function is defined as the mean Kullback–Leibler (KL) divergence between the output distributions of all tokens and the corresponding prior distributions.

Encoding Syntactic Dependency Knowledge

As shown in Figure 3, the syntactically-informed transformer encoder layer differs from the standard Transformer encoder layer by having one of the attention heads trained to predict the full ancestry for each token, i.e., parents, grandparents, great grandparents, etc. Different from  Strubell et al. (2018), we use full ancestry prediction instead of just direct parent prediction. Later we will demonstrate the benefits of our approach in the Results section.

Given an input sequence of length , the output of a regular attention head is a matrix, in which each row contains the attention weights that a token puts on all the tokens in the input sequence. The output of the syntactically-informed attention head is also a matrix but this attention head is trained to assign weights only on the syntactic governors (i.e., ancestors) of each token.

To train this attention head, we define a loss function by the difference between the output attention weight matrix of the syntactically-informed attention head and a predefined prior attention weight matrix. The prior attention weight matrix contains the prior knowledge that each token should attend to its syntactic parse ancestors, with attention weights being higher on ancestors that are closer to that token. During training, we obtain the prior attention weights based on the outputs of a pre-trained dependency parser.

For example, in the utterance “list flights arriving in Toronto on March first”, the syntactic parse ancestors of word “first” are “March”, “arriving” and “flights” which are , and hops away on the dependency tree, respectively, as shown in Figure 4. The ancestors are syntactically meaningful for the determination of the slot of “first”, which is “arrive date, day number” in this case.

Figure 4: Syntactic dependency tree of “list flights arriving in Toronto on March first” and prior attention weights of word “first”.

To train the attention head to assign higher weights on the ancestors of each token, we define prior attention weights of each token based on the distance between the token and its ancestors. Formally, the prior attention weights of token are defined as:

(4)

in which is the distance between token and , is the Softmax function,

is the temperature of the Softmax function controlling the variance of the attention weights over all the ancestors and

. The stack of prior attention weights of all tokens is a matrix, denoted by . We train our model to decrease the difference between and the attention matrix output by the attention head at the th layer. The difference is measured by the mean of row-wise KL divergence between these two matrices, which is used as an additional loss function besides the NLU loss functions. We refer to this loss as dependency loss denoted by , formally:

(5)
(6)

in which denotes the KL divergence, and are linear transformations of , is the th row of , and is a parameter matrix. In (6) we use the biaffine attention instead of the scaled dot product attention, which has been shown to be effective for dependency parsing Dozat and Manning (2016).

We treat

as a hyperparameter and tune it on the validation set. With

, attention head will be trained to only pay attention to the direct parent of each token, a special case used by Strubell et al. (2018). Thus, our method is a more general approach compared to Strubell et al. (2018).

Encoding Part-of-Speech Knowledge

Part-of-Speech (POS) information is important for disambiguating words with multiple meanings Alva and Hegde (2016). This is because an ambiguous word carries a specific POS in a particular context Pal et al. (2015). For instance, the word “May” could be either a verb or a noun. Being aware of its POS tag is beneficial for down-stream tasks, such as predicting the slots in the utterance “book a flight on May 1st”. Furthermore, previous studies have shown that while models trained for a sufficiently large number of steps can potentially learn underlying patterns of POS, the knowledge is imperfect Jawahar et al. (2019); Sundararaman et al. (2019). For these reasons, we explicitly train our model to perform POS tagging using the POS tags generated by a pretrained POS tagger.

Similar to slot filling, we simplify POS tagging as a token-level classification problem. We apply a MLP-based classifier on the output embeddings of the

th transformer encoder layer and use cross entropy as the loss function:

(7)
(8)

in which is the predicted probability of the th token’s POS label being the th label in the POS label space, is the total number of POS labels, is the one-hot representation of the groundtruth POS label.

Intent Detection and Slot Filling

Intent detection:

We apply a linear classifier on the embedding of the “[SOS]” token, , which is output by the last Transformer encoder layer . Cross entropy loss is used for intent detection. The loss on one utterance is defined as:

(9)
(10)

in which and are the parameters of the linear classifier, is the one-hot representation of the ground truth intent label, is the total number of intent labels and is the predicted probability of this utterance’s intent label being the th label in the intent label space.

Slot filling:

We apply a MLP-based classifier on the embeddings output by the last Transformer encoder layer using cross entropy as the loss function. The loss on one utterance is defined as follow:

(11)
(12)

in which is the predicted probability of the th token’s slot being the th label in the slot space, is the total number of slots, and is the one-hot representations of the ground truth slot label.

Multi-task Learning

We train our model via multi-task learning Caruana (1993). Our loss function is defined as:

(13)

where equals to for slot filling, for intent detection or for joint training, and and are coefficients of the dependency prediction loss and the POS tagging loss, respectively. and are treated as hyperparameters and selected based on validation performance.

SNIPS ATIS
SF ID SF ID-M ID-S
Joint Seq Hakkani-Tür et al. (2016) 87.30 96.90 94.30 92.60 -
Attention-based RNN Liu and Lane (2016) 87.80 96.70 95.78 - 97.98
Slot-Gated  Goo et al. (2018) 89.27 96.86 95.42 95.41 -
SF-ID, SF first E et al. (2019) 91.43 97.43 95.75 97.76 -
SF-ID, ID first E et al. (2019) 92.23 97.29 95.80 97.09 -
Stack-Propagation Qin et al. (2019) 94.20 98.00 95.90 96.90 -
Graph LSTM Zhang et al. (2020a) 95.30 98.29 95.91 97.20 -
TF 96.37 98.29 95.31 96.42 97.65
SyntacticTF (Independent) 96.56 98.71 95.94 97.76 98.10
SyntacticTF (Joint) 96.89 99.14 96.01 97.31 98.32
JointBERT Chen et al. (2019) 97.00 98.60 96.10 97.50 -
Table 1: SF and ID results on the ATIS and SNIPS dataset (%). TF refers to the Transformer encoder-based model trained without syntactic information. SyntacticTF refers to our model. Independent and Joint refer to independently and jointly training for SF and ID, respectively. ID-M refers to multiple label matching for intent detection evaluation and ID-S refers to single label matching. This work relies on pretraining, which is not required by other works in the table.

Experiments

Datasets

We conducted experiments on two benchmark datasets: the Airline Travel Information Systems (ATIS) Hemphill et al. (1990) and SNIPS Coucke et al. (2018) datasets. The ATIS dataset has a focus on airline information and has been used as benchmark on NLU tasks. We used the same version as Goo et al. (2018); E et al. (2019) that contains 4,478 utterances for training, 500 for validation and 893 for testing. The SNIPS dataset has a focus on personal assistant commands, with a larger vocabulary size and more diverse intents and slots. It contains 13,084 utterances for training, 700 for validation and 700 for testing.

Evaluation Metrics

We use classification accuracy for intent detection and the F1 score for slot filling, which is the harmonic mean of precision and recall. For the SNIPS dataset, we use the same version and evaluation method as pervious works 

Zhang et al. (2020a). For the ATIS dataset, we find that previous works use two different evaluation methods for intent detection on utterances with multiple labels. The first method counts a prediction as correct if it is equal to one of the ground truth labels of the utterance  Liu and Lane (2016). We refer this method as the single label matching method (ID-S). The second method counts a prediction as correct only if it matches all labels of the utterance  Goo et al. (2018); E et al. (2019). We refer this method as the multiple label matching method (ID-M). We report both in our results.

Implementation Details

Our experiments are implemented in PyTorch 

Paszke et al. (2017). The hyperparameters are selected based on the performance on the validation set. We use the Adam optimizer Kingma and Ba (2015) with , , and the weight decay fix as described in Loshchilov and Hutter (2017). Our learning rate schedule first increases the learning rate linearly from 0 to 0.0005 (warming up) and then decreases it to 0 following the values of the cosine function. We use warming up steps of the total training steps. The specific number of warming up steps is determined by validation performance. We use the implementation of the optimizer and learning rate scheduler of the Transformers library Wolf et al. (2019).

We use Stanza Qi et al. (2020) to generate training labels for POS tagging and dependency prediction. For the NLU model trained with both dependency prediction and POS tagging, and are both set to . For the NLU model trained with only dependency prediction, is set to . We used weight decay of 0.1 and dropout rate Srivastava et al. (2014) of 0.1 and 0.3 for the SNIPS and ATIS dataset, respectively. We use batch size of 32 and train each model for epochs. We report the testing results of the checkpoints achieving the best validation performance.

We use the concatenation of GloVe embeddings Pennington et al. (2014) and character embeddings Hashimoto et al. (2017) as token embeddings and keep them frozen during training. The hidden dimension of the Transformer encoder layer is 768 and the size of feed forward layer is 3072. Considering the small size of the two datasets, we only use two Transformer encoder layers in total (with , and as in Figure 2), each of which has attention heads. The dimension of and is . For slot filing, we apply the Viterbi decoding at test time. BIO is a standard format for slot filling annotation schema, as shown in Figure 1. The transition probabilities are manually set to ensure the output sequences of BIO labels to be valid, by simply specifying the probabilities of invalid transition to zero and the probabilities of valid transition to one.

Baseline Models

We compare our proposed model with the following baseline models:

  • Joint Seq Hakkani-Tür et al. (2016) is a joint model for intent detection and slot filling based on the bi-directional LSTM model.

  • Attention-based RNN Liu and Lane (2016) is a sequence-to-sequence model with the attention mechanism.

  • Slot-Gated  Goo et al. (2018) utilizes intent information for slot filling through the gating mechanism.

  • SF-ID E et al. (2019) is an architecture that enables the interaction between intent detection and slot filling.

  • Stack-Propagation Qin et al. (2019) is a joint model based on the Stack-Propagation framework.

  • Graph LSTM Zhang et al. (2020a) is based on the Graph LSTM model.

  • JointBERT Chen et al. (2019) is a joint NLU model fined tuned from the pretrained BERT model Devlin et al. (2018).

  • TF is the Transformer encoder-based model trained without syntactic information.

Results

Table 1 shows the performance of the baseline and proposed models for SF and ID on the SNIPS and ATIS dataset, respectively. Overall, our proposed models achieve the best performance on the two benchmarking datasets. On the SNIPS dataset, our proposed joint model achieves an absolute F1 score and accuracy improvement of and for SF and ID, respectively, compared to the best performed baseline model without pre-training  Zhang et al. (2020a). On the ATIS dataset, our proposed joint model also achieves an absolute F1 score and accuracy improvement of and for SF and ID-S, compared to the best performed baseline model for SF Zhang et al. (2020a) and ID-S Liu and Lane (2016), respectively. In addition, our proposed independent model achieves the same performance as the best performed baseline model on ID-M (E et al., 2019, SF-ID, SF first).

Besides, the model based on Transformer encoder without syntactic knowledge can achieve SOTA results on the SNIPS dataset and is slightly worse than the SOTA results on the ATIS dataset. This indicates the powerfulness of the Transformer encoder for SF and ID. Moreover, the further improvement of our models over the baseline models demonstrates the benefits of incorporating syntactic knowledge. Additionally, compared to previous works with heterogenous model structures, our models are purely based on self-attention and feed forward layers.

We also find that our proposed models can outperform the JointBERT model with pre-training Chen et al. (2019) for intent detection tasks. Compared to the JointBERT model, our proposed joint model achieves an absolute accuracy improvement of for ID on the SNIPS dataset; and our proposed independent model achieves an absolute accuracy improvement of for ID-M on the ATIS dataset. While our proposed model does not outperform the JointBERT model for SF, the performance gap is relatively small ( on SNIPS and on ATIS). It should be noted that our model does not require pre-training and the size of our model is only one seventh of the JointBERT model (16 million vs. 110 million parameters).

Previous works have shown that models like BERT can learn syntactic knowledge by self-supervision Clark et al. (2019); Manning et al. (2020). This can partially explain why the JointBERT can achieve very good results without being fed with syntactic knowledge explicitly.

Ablation Study

Table 2 shows the ablation study results of the effects of adding different syntactic information. A first observation is that the model trained with a singe syntactic task, either dependency prediction or POS tagging, outperforms the baseline Transformer encoder-based model without syntactic information. This gives us confidence that syntactic information can help improve model performance. Moreover, training a Transformer model with both the syntactic tasks achieves even better results than training with a single syntactic task. This could be because the POS tagging task improves the performance of the dependency prediction task Nguyen and Verspoor (2018), which in turn improves the performance of SF and ID.

SNIPS ATIS
SF ID SF ID-M ID-S
TF 96.37 98.29 95.31 96.42 97.65
TF + D 96.31 98.43 95.99 96.53 98.76
TF + P 96.47 98.57 95.82 97.31 98.10
TF + D + P 96.56 98.71 95.94 97.76 98.10
Table 2: Results of ablation study. TF refers to the baseline models with two Transformer encoder layers. D and P refers to dependency prediction and POS tagging, respectively.

Interestingly, we observe that the addition of dependency prediction reduces the performance of slot filling on the SNIPS dataset () when compared to the baseline Transformer encoder-based model (). There are several potential reasons. Firstly, the sentences in the SNIPS dataset are overall shorter than the ATIS dataset so that the syntactic dependency information might be less helpful. Secondly, previous work has shown that syntactic parsing performance often suffers when a named entity span has crossing brackets with the spans on the parse tree Finkel and Manning (2009). Thus, the dependency prediction performance of our model might decrease due to the presence of many name entities in the SNIPS dataset, such as song names and movie names, which could introduce noisy dependency information into the attention weights and degrade the performance on the NLU tasks.

Qualitative Analysis

We qualitatively examined the errors made by the Transformer encoder-based models with and without syntactic information to understand in what ways syntactic information helps improve the performance. Our major findings are:

ID errors related to preposition with nouns: Prepositions, when appearing between nouns, are used to describe their relationship. For example, in the utterance “kansas city to atlanta monday morning flights”, the preposition “to” denotes the direction from “kansas city” (departure location, noun) to “atlanta” (arrival location, noun). Without this knowledge, a model could misclassify the intent of this utterance as asking for city information rather than flight information. We found that about of the errors made by the model without syntactic information contain this pattern, whereas less than of the misclassified utterances contain this pattern for the model with syntactic information (See Appendix A for the full list).

SF errors due to POS confusion: A Word can have multiple meanings depending on context. For example, the same word “may” can be a verb expressing possibility, or as a noun referring to the fifth month of the year. We found that correctly recognizing the POS of words could potentially help reduce slot filling errors. For example, in this utterance “May I have the movie schedules for Speakeasy Theaters”, the slot for “May” should be empty, but the model without syntactic information predicts it as “Time Range”. By contrast, the model with syntactic information predicts correctly for this word, probably because the confusion of noun vs. verb for the word “May” is addressed by incorporating POS information. More examples are included in Appendix A.

ATIS SNIPS
SF ID SF ID-S
TF + Par. 96.20 98.29 95.58 98.10
TF + Anc. 96.31 98.43 95.99 98.76
Table 3: Intent detection and slot filling results of Transformer (TF) encoder-based models with dependency parent prediction (Par.) and dependency ancestor prediction (Anc.) on the ATIS and SNIPS dataset.

Parent Prediction vs. Ancestor Prediction

We compare our approach of predicting all ancestors of each token with the approach described in Strubell et al. (2018), which only predicts direct dependency parent of each token. Results in Table 3 show that the model with our approach can achieve better results for both ID and SF on the two datasets, which demonstrates that our approach is more beneficial to the NLU tasks. We hypothesize that incorporating syntactic ancestor prediction can better capture long-distance syntactic relationship. As shown in Tur et al. (2010), long distance dependencies are important for slot filling. For example, in the utterance “Find flights to LA arriving in no later than next Monday”, a 6-gram context is needed to figure out that “Monday” is the arrival date instead of the departure date.

Figure 5: Visualization of the attention weights of the model with and without syntactic supervision for slot filling. and stands for the th Transformer layer and th attention head, respectively. The attention head inside the red-dotted box is trained for dependency prediction.

Visualization of Attention Weights

We visualize the attention weights output by models trained with and without syntactic information to understand what the models have learned by incorporating syntactic information. We select the utterance “show me the flights on american airlines which go from st. petersburg to ontario california by way of st. louis” from the ATIS testing set. Only the model trained with syntactic information predicts the slot labels correctly. As shown in Figure 5, the model without syntactic information has simple attention patterns on both layers, such as looking backward and looking forward. Other attention heads seem to be random and less informative.

In contrast, the model with syntactic information has more informative attention patterns. On the first layer, all the attention heads present simple but diverse patterns. Besides looking forward and backwards, the second attention head looks at both directions for each token. On the second layer, however, we observe more complex patterns and long-distance attention which could account for more task-oriented operations. Therefore, it is possible that the Transformer encoder learns attention weights better with syntactic information supervision so that the encoder can leave more power for the end task.

Related Work

Research on intent detection and slot filling emerged in the from the call classification systems Gorin et al. (1997) and the ATIS project Price (1990)

. Early work has primarily focused on using a traditional machine learning classifier such as CRFs

Haffner et al. (2003). Recently, there has been an increasing application of neural models on NLU tasks. These approaches, primarily based on RNNs, have shown that neural approaches outperform traditional models Mesnil et al. (2014); Tur et al. (2012); Zhang and Wang (2016); Goo et al. (2018); E et al. (2019). For example, Mesnil et al (2015) employed RNNs for slot filling and found an relative improvement of F1 compared to CRF Mesnil et al. (2014). Some works also explored Transformer encoder and graph LSTM-based neural architectures Chen et al. (2019); Zhang et al. (2020a).

Syntactic information has been shown to be beneficial to many tasks, such as neural machine translation Akoury et al. (2019), semantic role labeling Strubell et al. (2018), and machine reading comprehension Zhang et al. (2020b). Research on NLU tasks has also shown that incorporating syntactic information into machine learning models can help improve the performance. Moschitti et alMoschitti et al. (2007) used syntactic information for slot filling, where the authors used a tree kernel function to encode the structural information acquired by a syntactic parser. An extensive analysis on the ATIS dataset revealed that most NLU errors are caused by complex syntactic characteristics, such as prepositional phrases and long distance dependencies Tur et al. (2010). Tur et alTur et al. (2011) proposed a rule-based dependency parsing based sentence simplification method to augment the input utterances based on the syntactic structure. Compared to previous works, our work is the first to encode syntactical knowledge into end-to-end neural models for intent detection and slot filling.

Conclusion

In this paper, we propose to encode syntactic knowledge into the Transformer encoder-based model for intent detection and slot filling. Experimental results indicate that a model with only two Transformer encoder layers can already match or even outperform the SOTA performance on two benchmark datasets. Moreover, we show that the performance of this baseline model can be further improved by incorporating syntactical supervision. The visualization of the attention weights also reveals that syntactical supervision can help the model to better learn syntactically-related patterns. For future work, we will evaluate our approach with larger model sizes on larger scale datasets containing more syntactically complex utterances. Furthermore, we will investigate incorporating syntactic knowledge into models pretrained by self-supervision and applying those models on the NLU tasks.

Acknowledgement

We would like to thank Siegfried Kunzmann, Nathan Susanj, Ross McGowan, and anonymous reviewers for their insightful feedback that greatly improved our paper.

References

  • N. Akoury, K. Krishna, and M. Iyyer (2019) Syntactically supervised transformers for faster neural machine translation. arXiv preprint arXiv:1906.02780. Cited by: Related Work.
  • J. Allen (1995) Natural language understanding (2nd ed.). Benjamin-Cummings Publishing Co., Inc., USA. External Links: ISBN 0805303340 Cited by: Introduction.
  • P. Alva and V. Hegde (2016) Hidden markov model for pos tagging in word sense disambiguation. In 2016 International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS), pp. 279–284. Cited by: Encoding Part-of-Speech Knowledge.
  • R. Caruana (1993) Multitask learning: a knowledge-based source of inductive bias. In ICML’93 Proceedings of the Tenth International Conference on International Conference on Machine Learning, pp. 41–48. Cited by: Multi-task Learning, Proposed Model.
  • Q. Chen, Z. Zhuo, and W. Wang (2019) BERT for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909. Cited by: Introduction, Table 1, 7th item, Results, Related Work.
  • Y. Chen, D. Hakanni-Tür, G. Tur, A. Celikyilmaz, J. Guo, and L. Deng (2016) Syntax or semantics? knowledge-guided joint semantic frame parsing. In 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 348–355. Cited by: Introduction.
  • K. Clark, U. Khandelwal, O. Levy, and C. D. Manning (2019) What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341. Cited by: Results.
  • R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa (2011) Natural language processing (almost) from scratch. Journal of machine learning research 12 (ARTICLE), pp. 2493–2537. Cited by: Introduction.
  • A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril, et al. (2018) Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190. Cited by: Introduction, Datasets.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: 7th item.
  • T. Dozat and C. D. Manning (2016) Deep biaffine attention for neural dependency parsing. arXiv preprint arXiv:1611.01734. Cited by: Introduction, Encoding Syntactic Dependency Knowledge.
  • H. E, P. Niu, Z. Chen, and M. Song (2019) A novel bi-directional interrelated model for joint intent detection and slot filling. In ACL 2019 : The 57th Annual Meeting of the Association for Computational Linguistics, pp. 5467–5471. Cited by: Introduction, Table 1, 4th item, Datasets, Evaluation Metrics, Results, Related Work.
  • J. R. Finkel and C. D. Manning (2009)

    Joint parsing and named entity recognition

    .
    In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 326–334. Cited by: Ablation Study.
  • C. Goo, G. Gao, Y. Hsu, C. Huo, T. Chen, K. Hsu, and Y. Chen (2018) Slot-gated modeling for joint slot filling and intent prediction. In NAACL HLT 2018: 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 2, pp. 753–757. Cited by: Introduction, Table 1, 3rd item, Datasets, Evaluation Metrics, Related Work.
  • A. L. Gorin, G. Riccardi, and J. H. Wright (1997) How may i help you?. Speech communication 23 (1-2), pp. 113–127. Cited by: Related Work.
  • P. Haffner, G. Tur, and J. H. Wright (2003) Optimizing svms for complex call classification. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., Vol. 1, pp. I–I. Cited by: Related Work.
  • D. Hakkani-Tür, G. Tür, A. Çelikyilmaz, Y. Chen, J. Gao, L. Deng, and Y. Wang (2016) Multi-domain joint semantic frame parsing using bi-directional rnn-lstm.. In Interspeech 2016, pp. 715–719. Cited by: Table 1, 1st item.
  • K. Hashimoto, C. Xiong, Y. Tsuruoka, and R. Socher (2017)

    A joint many-task model: growing a neural network for multiple nlp tasks

    .
    In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1923–1933. Cited by: Implementation Details.
  • C. T. Hemphill, J. J. Godfrey, and G. R. Doddington (1990) The atis spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990, Cited by: Introduction, Datasets.
  • G. Jawahar, B. Sagot, and D. Seddah (2019) What does bert learn about the structure of language?. Cited by: Encoding Part-of-Speech Knowledge.
  • D. P. Kingma and J. L. Ba (2015) Adam: a method for stochastic optimization. In ICLR 2015 : International Conference on Learning Representations 2015, Cited by: Implementation Details.
  • B. Liu and I. Lane (2016) Attention-based recurrent neural network models for joint intent detection and slot filling. In Interspeech 2016, pp. 685–689. Cited by: Table 1, 2nd item, Evaluation Metrics, Results.
  • I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: Implementation Details.
  • C. D. Manning, K. Clark, J. Hewitt, U. Khandelwal, and O. Levy (2020) Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences. Cited by: Results.
  • G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu, et al. (2014) Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (3), pp. 530–539. Cited by: Related Work.
  • A. Moschitti, G. Riccardi, and C. Raymond (2007) Spoken language understanding with kernels for syntactic/semantic structures. In

    2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)

    ,
    pp. 183–188. Cited by: Related Work.
  • D. Q. Nguyen and K. Verspoor (2018) An improved neural network model for joint pos tagging and dependency parsing. arXiv preprint arXiv:1807.03955. Cited by: Introduction, Ablation Study.
  • A. R. Pal, A. Munshi, and D. Saha (2015) An approach to speed-up the word sense disambiguation procedure through sense filtering. arXiv preprint arXiv:1610.06601. Cited by: Encoding Part-of-Speech Knowledge.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: Implementation Details.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: Implementation Details.
  • P. Price (1990) Evaluation of spoken language systems: the atis domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990, Cited by: Related Work.
  • P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning (2020) Stanza: a Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Cited by: Implementation Details.
  • L. Qin, W. Che, Y. Li, H. Wen, and T. Liu (2019) A stack-propagation framework with token-level intent detection for spoken language understanding. In 2019 Conference on Empirical Methods in Natural Language Processing, pp. 2078–2087. Cited by: Introduction, Table 1, 5th item.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: Implementation Details.
  • E. Strubell, P. Verga, D. Andor, D. Weiss, and A. McCallum (2018) Linguistically-informed self-attention for semantic role labeling. In EMNLP 2018: 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5027–5038. Cited by: Introduction, Introduction, Encoding Syntactic Dependency Knowledge, Encoding Syntactic Dependency Knowledge, Parent Prediction vs. Ancestor Prediction, Related Work.
  • D. Sundararaman, V. Subramanian, G. Wang, S. Si, D. Shen, D. Wang, and L. Carin (2019) Syntax-infused transformer and bert models for machine translation and natural language understanding. arXiv preprint arXiv:1911.06156. Cited by: Introduction, Encoding Part-of-Speech Knowledge.
  • G. Tur, L. Deng, D. Hakkani-Tür, and X. He (2012) Towards deeper understanding: deep convex networks for semantic utterance classification. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5045–5048. Cited by: Related Work.
  • G. Tur, D. Hakkani-Tur, L. Heck, and S. Parthasarathy (2011) Sentence simplification for spoken language understanding. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5628–5631. Cited by: Introduction, Related Work.
  • G. Tur, D. Hakkani-Tür, and L. Heck (2010) What is left to be understood in atis?. In 2010 IEEE Spoken Language Technology Workshop, pp. 19–24. Cited by: Parent Prediction vs. Ancestor Prediction, Related Work.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5998–6008. Cited by: Introduction, Transformer Encoder Layer.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: Implementation Details.
  • L. Zhang, D. Ma, X. Zhang, X. Yan, and H. Wang (2020a) Graph lstm with context-gated mechanism for spoken language understanding.

    AAAI 2020 : The Thirty-Fourth AAAI Conference on Artificial Intelligence

    34 (5), pp. 9539–9546.
    Cited by: Table 1, 6th item, Evaluation Metrics, Results, Related Work.
  • X. Zhang and H. Wang (2016) A joint model of intent determination and slot filling for spoken language understanding.. In IJCAI, Vol. 16, pp. 2993–2999. Cited by: Introduction, Related Work.
  • Z. Zhang, Y. Wu, J. Zhou, S. Duan, H. Zhao, and R. Wang (2020b) SG-net: syntax-guided machine reading comprehension.. In AAAI, pp. 9636–9643. Cited by: Related Work.

Appendix A

Below lists examples of the intent detection errors made by the model without syntactic information that are related to one specific grammar pattern between prepositions and nouns.

  • cleveland to kansas city arrive monday before 3 pm

  • kansas city to atlanta monday morning flights

  • new york city to las vegas and memphis to las vegas on Sunday

Below lists examples of the slot filling errors made by the model without syntactic information that contain POS confusion.

  • cleveland to kansas city arrive monday before 3 pm

  • new york city to las vegas and memphis to las vegas on Sunday

  • baltimore to kansas city economy

The Transformer encoder-based model without syntactic information made mistakes on all these utterances. The model trained with POS tagging and the model trained with both POS tagging and dependency prediction fail on the last utterance in the list below. The model trained with dependency prediction does not make any mistakes on all these utterances. We underline the words that are assigned to wrong slots by the model without syntactic information.

  • book a reservation for velma an a and rebecca for an american pizzeria at (correct: ; prediction: ) Am in MA

  • Where is Belgium located (correct: ; prediction: )

  • May(correct: ; prediction: ) I have the movie schedules for Speakeasy Theaters