Adoption of intelligent voice assistants such as Amazon Alexa, Apple Siri, and Google Assistant has increased dramatically among consumers in the past few years: as of early 2019, it is estimated that 21% of U.S. adults own a smart speaker, a 78% year-over-year growth(npr2019smart). These systems are built to process user dialog and perform tasks such as media playback and online shopping.
A major part of any voice assistant is a semantic parsing component designed to understand the action requested by its users: given the transcription of an utterance, a voice assistant must identify the action requested by a user (play music, turn on lights, etc.), as well as parse any entities that further refine the action to perform (which song to play? which lights to turn on?) Despite huge advances in the field of Natural Language Processing (NLP), this task still remains challenging due to the sheer number of possible combinations a user can use to express a command.
Traditional approaches for task-oriented semantic dialog parsing frame the problem as a slot filling task. For example, given the query Play the song don’t stop believin by Journey, a traditional slot filling system parses it in two independent steps: (i
) It first classifies theintent of the user utterance as PlaySongIntent, and then (ii) identifies relevant named entities and tags those slots, such as don’t stop believin as a SongName and Journey as an ArtistName. Traditional semantic parsing can therefore be reduced to a text classification and a sequence tagging problem, which is a standard architecture for many proposed approaches in literature (liu2016attention; mesnil2013investigation; lafferty2001conditional). This is shown in Figure 1.
With increasing expectations of users from virtual assistants, there is a need for the systems to handle more complex queries – ones that are composed of multiple intents and nested slots or contain conditional logic. For example, the query Are there any movie in the park events nearby? involves first finding the location of parks that are nearby and then finding relevant movie events in them. This is not straightforward in traditional slot filling systems. gupta2018semantic and einolghozati2019improving
proposed multiple approaches for this using a Shift-reduce parser based on Recurrent Neural Network Grammars(dyer2016recurrent) that performs the tagging.
In this paper, we propose a unified approach to tackle semantic parsing for natural language understanding based on Transformer Sequence to Sequence models (vaswani2017attention) and a Pointer Generator Network (vinyals2015pointer; see2017get). Furthermore, we demonstrate how our approach can leverage pre-trained resources, such as neural language models, to achieve state of the art performance on several datasets. In particular, we obtain relative improvements between 3.3% and 7.7% over the best single systems on three public datasets (SNIPS (coucke2018snips), ATIS (price1990evaluation) and TOP (gupta2018semantic), the last consisting of complex queries); on two internal datasets, we show relative improvements of up to 4.9%.
Furthermore, our architecture can be easily used to parse queries that do not conform to the grammar of either the slot filling or RNNG systems. Some examples include semantic entities that correspond to overlapping spans in the query, and entities comprising of non-consecutive spans. We do not report any results on these kinds of datasets but we explain how to formulate the problems using our architecture.
In summary, our contributions are as follows.
We propose a new architecture based on Sequence to Sequence models and a Pointer Generator Network to solve the task of semantic parsing for understanding user queries.
We describe how to formulate different kinds of queries in our architecture. Our formulation is unified across queries with different kinds of tagging.
We achieve state-of-the-art results on three public datasets and two internal datasets.
We propose a unified architecture to solve the task of semantic parsing for both simple and complex queries. This architecture can also be adapted to handle queries containing slots with overlapping spans. It consists of a Sequence to Sequence model and a Pointer Generator Network. We choose a pretrained BERT (devlin2018bert) model as our encoder. Our decoder is modeled after the transformer decoder described in vaswani2017attention and is augmented with a Pointer Generator Network (vinyals2015pointer; jia2016data) which allows us to learn to generate pointers to the source sequence in our target sequence. Figure 3
shows this architecture parsing an example query. We train the model using a cross-entropy loss function with label smoothing.
In this section, we first describe how we formulate queries and their semantic parses as sequences with pointers for our architecture. We then describe our encoder and decoder components.
2.1. Query Formulation
A Sequence to Sequence architecture is trained on samples with a source sequence and a target sequence. When some words in the target sequence are contained in the source sequence, they can be replaced with a separate pointer token that points to that word in the source to be able to apply the Pointer Generator Network.
Take the example query from Figure 1. In our architecture, we use the query as our source sequence. The target sequence is constructed by combining the intent with all the slots, in order, with each slot also containing its source words. The source and target sequences now look as follows.
Here, each token is a pointer to the word in the source sequence. So , and point to the song words don’t stop believin, and points to the artist word journey. The slots have open and close tags since they are enclosing a consecutive span of source tokens. The intent is just represented as a single tag at the beginning of the target sequence. We can do this for simple queries since they consist of just one intent. The target vocabulary hence consists of all the available intents, two times the number of different slots, and the pointers.
Complex queries with multiple intents and nested slots can also be transformed easily into this formulation. Figure 2 shows an example from the Facebook TOP dataset along with its parse tree. This query How far is the coffee shop can be converted into our formulation as follows.
We made a minor modification to the reference parses from the TOP dataset for our formulation. We replaced the end-brackets with custom end-brackets corresponding to the intent or slot they close. We found that this formulation helped our models perform better.
Finally, we show how we can express queries from datasets that don’t conform to either the slot-filling or Shift-reduce systems. Take the following example from the healthcare domain, where the task is to extract a patient diagnosis and related information from a clinician’s notes.
A traditional slot filling system wouldn’t know which non consecutive slots to combine, while a shift-reduce parser cannot split the middle word into a separate tag. In our architecture, we simply formulate the target sequence as follows.
2.2. BERT Encoder
Language model pretraining has been shown to improve the downstream performance on many NLP tasks (peters2018deep; radford2019language; devlin2018bert). The idea is to train a language model on a large amount of text using a next word prediction objective to learn good representations for each of the words. These representations can then be fine-tuned on a given NLP task to improve the performance of an existing model. Pretrained models improve the performance of task models since they already contain a lot of useful semantic information learned through the pretraining phase. This has even more significance when the task-specific dataset is fairly small. Some examples of pretrained models in literature include word embeddings such as Word2Vec (mikolov2013distributed) and Glove (pennington2014glove), and contextualized representations such as ELMo (peters2018deep), OpenAI-GPT (radford2019language), and BERT (devlin2018bert).
We choose BERT to encode the source sequence in our architecture. BERT (Bidirectional Encoder Representations from Transformers) is a language representation model architecture based on Transformers (vaswani2017attention). The original publicly available model was pretrained on a millions of lines of text from BooksCorpus and English Wikipedia. Unlike other language models (ELMo, OpenAI-GPT), which are trained to predict the next token given the previous sequence of words, BERT uses a composite objective that combines masked word prediction and next sentence prediction.
BERT’s architecture is based on a multi-layer bidirectional Transformer, originally implemented in vaswani2017attention. The detailed implementation of this architecture can be found in devlin2018bert. For our experiments, we use three different variants of BERT.
For the three public datasets, we used the checkpoint released by devlin2018bert. Experiments on the two internal datasets were carried out using a model we pretrained over a large sample of queries from the live traffic of Amazon Alexa. We also experimented with a publicly-available variant of BERT called RoBERTa (liu2019roberta). RoBERTa (A Robustly Optimized BERT Pretraining Approach) uses the same architecture as BERT but changes the pretraining process. The next sentence prediction objective is removed and a dynamic masking scheme is used instead of a static one like the original BERT implementation. RoBERTa was also trained with longer sequences, higher batch-sizes, and for a longer time, and was reported to match or exceed the performance of BERT in several NLP benchmarks. The detailed implementation can be found in liu2019roberta. Finally as an ablation study, we experimented with an encoder with no pretrained weights.
2.3. Decoder with Pointer Generator Network
We use the transformer decoder proposed in vaswani2017attention in our architecture. The self-attention mechanism in the decoder learns to attend to target words before the current step, as well as all the source words in the encoder.
We set up the decoder with different numbers of units, layers, and attention-heads for different tasks based on the size and complexity of the queries. These details are provided in the experiments section.
In a traditional Sequence to Sequence model, the target words are generated from the decoder hidden states through a feed-forward layer that obtains unnormalized scores over a target vocabulary distribution. In our architecture, we use a Pointer Generator Network to generate two different kinds of target words: words from the target vocabulary consisting of parse symbols (the intent and slot delimiters), and words that are simply pointers to the source sequence. Our Pointer Generator Network is based on the models in vinyals2015pointer and see2017get, and is closest in implementation to jia2016data.
We now describe our decoding process. For each input source sequence , we use the BERT encoder to encode it into a sequence of encoder hidden states . Having generated the first output tokens, the transformer decoder generates the token at step as follows.
First, the decoder produces the decoder hidden state at time , by building multi-layer multi-head self-attention on the encoded output as well as the embeddings of the previously generated output sequence as described in vaswani2017attention. We feed through a dense layer to produce scores for each word in the vocabulary . contains all symbols necessary for the parse (intents, slots) but not regular words appearing on the source side.
We also use as a query and compute unnormalized attention scores with the encoded sequence. Concatenating the unnormalized attention scores (size ) and the output of the dense layer (size ), we obtain an unnormalized distribution over tokens, the first of which are the output parsing vocabulary and the last of which are the (
) words pointing to the source tokens. We then feed this through a softmax layer to obtain the final probability distribution. This probability is used in the loss function during training and will be used to choose the next token to generate during inference. Since the transformer decoder uses embeddings of previously generated tokens, we use a set of special embeddings to representtokens.
In the example in Figure 3, we are trying to predict a target word after the token ’MediaType(’ at step 5. As shown in the figure, we compute the scores (blue, left) over each of the source tokens, and the scores (green, right) over the parsing vocabulary. We expect the model to produce the highest score for , which corresponds to , representing the word hits.
We test our approach on five different datasets (three publicly available, two interal), which we describe in this section.
3.1. Facebook TOP
The Task Oriented Parsing (TOP) (gupta2018semantic) dataset contains complex hierarchical and nested queries that make the task of semantic parsing more challenging. It contains around 45k annotations with 25 intents and 36 slots, randomly split into 31k training, 5k validation and 9k test utterances. The dataset mainly consists of user queries about navigation and various public events. An example from this dataset can be seen in Figure 2. The IN: prefix stands for intent while SL: is for slot. We can see how there are multiple intents and nested slots in the semantic interpretation. This makes the query much harder to interpret and parse using a simple slot tagging model that tags each word with a single slot.
The SNIPS dataset (coucke2018snips) is a public dataset that is used for training and testing semantic parsing models for voice assistants. It consists of utterances that belong to seven different intents: SearchCreativeWork, GetWeather, BookRestaurant, PlayMusic, AddToPlaylist, RateBook, and SearchScreeningEvent. Each intent contains around 2000 examples to train and 100 to test.
This dataset contains only simple queries with single intents and flat slots. An example is Will there be fog in Tahquamenon Falls State Park, where the intent is GetWeather and the slots are condition_description for fog and geographic_poi for Tahquamenon Falls State Park.
The dataset was originally used to evaluate models in the Snips Voice Platform. It has since been a widely used dataset to benchmark the performance of various task-oriented parsing models.
The Airline Travel Information System (ATIS) (price1990evaluation) corpus is a widely used dataset in spoken language understanding. It was built by collecting and transcribing audio recordings of people making flight reservations in the early 90s. It consists of simple queries.
There are seventeen different goals or intents such as Flight or Aircraft capacity
. This distribution is however skewed, with theFlight intent covering about 70% of the total queries. An example from this dataset consists of the query How much is the cheapest flight from Boston to New York tomorrow morning? The intent is Airfare, while the slots tag important information like the departure and arrival cities, and the departure times.
The ATIS corpus has supported research in the field of spoken language understanding for more than twenty years. Some researchers have performed extensive error analysis on the state of the art discriminative models for this dataset and reported that despite really low error rates, there exist many unseen categories and sequences in the dataset that can benefit from incorporating linguistically motivated features (tur2010left). This supports the continued utility of ATIS as a research corpus.
3.4. Internal Datasets
Our internal datasets consist of millions of user utterances that are used to train and test Amazon Alexa. For our experiments, we sampled two datasets of utterances, one from the music domain and the other from video domain. Utterances in these domains naturally included a large amount of entities (e.g. artists and albums names, movie and video titles), and thus represent a good benchmark for the ability of any neural model to generalize over a diverse set of queries. The example in Figure 3 is from the music domain.
The sampled music domain dataset contains 6.2M training and 200k test utterances, with 23 intents and 100 slots. The video domain dataset contains 1M training and just 5k test utterances; parses in this dataset are comprised of 24 distinct intents and 59 slots.
4. Baseline Models
We benchmark our performance on the internal datasets by comparing it to a well tuned RNN based model. The model learns to perform joint intent and slot tagging using a bidirectional LSTM and a Conditional Random Field (CRF) (huang2015bidirectional). We further enhanced this baseline by replacing its embedding and encoder layers with a language model pretrained on a subset of Alexa’s live traffic. These components were fine-tuned on the two datasets described in Section 3.4.
For the ATIS and SNIPS datasets, we use the top four performing methods reported by zhang2019joint
as baselines. All these models perform joint intent and slot tagging. There are two variants that use RNNs: a simple RNN based model, and an RNN model augmented with attention. There is also a model that works completely with just attention, the slot gated full attention model. The final baseline, CapsuleNLU, uses Capsule Networks(sabour2017dynamic).
For the TOP dataset, we pick a model based on Recurrent Neural Network Grammars (RNNG) (dyer2016recurrent), the Shift Reduce Parser. We provide a brief overview of this model as described in gupta2018semantic - the parse tree is constructed using a sequence of transitions, or actions. The transitions are defined as a set of SHIFT, REDUCE, and the generation of intent and slot labels. SHIFT action consumes an input token (that is, adds the token as a child of the right most open sub-tree node) and REDUCE closes a sub-tree. The third set of actions is generating non-terminals: the slot and intent labels. The model learns to perform one of these actions at each step in time.
We report scores of three experimental setups with the shift reduce parser from einolghozati2019improving: a simple shift reduce parser, a shift reduce parser augmented with ELMo embeddings, and an ensemble of these models augmented with ELMo and an SVM language model reranker.
|Shift Reduce (SR) Parser (einolghozati2019improving)||80.86||–|
|SR with ELMo embeddings (einolghozati2019improving)||83.93||–|
|SR ensemble + ELMo + SVMRank (einolghozati2019improving)||87.25||–|
|Seq2Seq-Ptr (no pretraining)||79.25||97.43|
|Seq2Seq-Ptr (BERT encoder)||83.13||97.91|
|Seq2Seq-Ptr (RoBERTa encoder)||86.67||98.13|
|Joint BiRNN (hakkani2016multi)||73.20||96.90|
|Attention BiRNN (liu2016attention)||74.10||96.70|
|Slot Gated Full Attention (goo2018slot)||75.50||97.00|
|Seq2Seq-Ptr (no pretraining)||85.43||97.00|
|Seq2Seq-Ptr (BERT encoder)||86.29||98.29|
|Seq2Seq-Ptr (RoBERTa encoder)||87.14||98.00|
|Joint BiRNN (hakkani2016multi)||80.70||92.60|
|Attention BiRNN (liu2016attention)||78.90||91.10|
|Slot Gated Full Attention (goo2018slot)||82.20||93.60|
|Seq2Seq-Ptr (no pretraining)||81.08||95.18|
|Seq2Seq-Ptr (BERT encoder)||86.37||97.42|
|Seq2Seq-Ptr (RoBERTa encoder)||87.12||97.42|
5. Experimental Setup
All our models were trained on a machine with 8 NVIDIA Tesla V100 GPUs, each with 16GB of memory. When using pretrained encoders, we leveraged gradual unfreezing to effectively tune the language model layers on our datasets. We used the ”Base” variant of BERT and RoBERTa encoders, which uses 768-dimensional embeddings, 12 layers, 12 heads, and 3072 hidden units. When training from scratch, we used a smaller encoder consisting of 512-dimensional embeddings, 6 layers, 8 heads, and 1024 hidden units.
Depending on the dataset, we used either a 128 units, 4 layers, 3 heads, and 512 hidden units decoder (Facebook TOP, ATIS, SNIPS) or a larger 512 units, 6 layers, 8 heads, and 1024 hidden units decoder (internal Music and Video datasets). We used bi-linear product attention to score the source words in the Pointer Network.
While training, the cross entropy loss function was modified with label smoothing with . We used the Adam (kingma2014adam) optimizer with noam learning rate schedule (vaswani2017attention), each adjusted differently for different datasets. At inference time, we used beam search decoding with a beam size of 4.
6. Results and Discussion
We use exact match (EM) accuracy as the main metric to measure the performance of our models across all datasets. Under this metric, the entire semantic parse for a query has to match the reference parse to be counted as correct. Because EM is generally more challenging than slot-level precision and recall or semantic error rate(thomson2012nbest), it is better suited to compare high performing systems like the ones studied in this work. For completeness, we also report the intent classification accuracy for our models.
The results from our experiments are documented in Tables 1-5. Our models match or beat the baselines across all datasets on both exact match and intent classification accuracies. We see significant improvements on both simple and complex datasets.
6.1. Complex Queries
We achieve an improvement of 2.7 (+3.3%) EM accuracy points on the TOP dataset over the state-of-the-art single model on this dataset (Table 1). Our Seq2Seq-Ptr model with RoBERTa encoder is only surpassed by the ensemble model reported in einolghozati2019improving (+0.6% EM accuracy points.)
In addition, we find that even without specifying any hard requirements for the grammar of the parse trees in the complex queries, 98% of the generated parses are well formatted. For the simple query datasets, it was greater than 99% but difference is expected since the grammar is easier to learn there.
During error analysis, we found an interesting example in the TOP dataset where we believe our model generates a valid, more meaningful parse than the reference annotation. For the query What time do I need to leave to get to Helen by 8pm, our model parses Helen as [SL:DESTINATION [IN:GET_ LOCATION_HOME [SL:CONTACT Helen ] ] ], while it is annotated as [SL:DESTINATION Helen ]. Our parse resolves the query as finding the estimated departure time to get to a location that is the home location of a contact named Helen, while the reference annotation suggests that the correct interpretation is to find the estimated departure time to get to a destination named Helen. We believe our parse is more likely to be correct given that Helen is most likely the name of a person.
6.2. Simple Queries
On the SNIPS and ATIS datasets, we note that the best version of our method (Seq2Seq-Ptr with RoBERTa encoder) achieves a significant improvement in EM accuracy over existing baselines (+7.7% and +4.5% respectively.) Using a BERT encoder causes a slight decrease in performance, but still achieves a meaningful improvement over the previous state of the art (zhang2019joint); this is consistent with what has been observed on other NLP tasks (liu2019roberta). If no pretraining is used, performance is further reduced but it is notable that this variant still beats all the baselines on the SNIPS dataset.
For our internal Alexa datasets, we note that the proposed Seq2Seq-Ptr method obtains comparable results to a BiLSTM-CRF tagger on the music domain, and slightly better EM accuracy (+1.9%) on the video domain. We believe our model wasn’t able to outperform the baseline on the music domain because the entities in this domain are very diverse, especially song or album names. Sequence tagging methods therefore benefit from having to solve a simpler task of having to tag each word in the sequence, as opposed to our unconstrained model. We would however like to note that our from-scratch variants beat the from-scratch baselines on both domains. Also curiously, the performance of Seq2Seq-Ptr with a BERT encoder fell behind that of a sequence to sequence model trained from scratch. Since the scratch model uses a smaller transformer encoder (6 layers with 8 heads per layer instead of 12/12), we believe it was able to converge more effectively than the BERT encoder.
|BiLSTM-CRF (no pretraining)||baseline|
|BiLSTM-CRF (pretrained LM)||+3.0%||+0.1%|
|Seq2Seq-Ptr (no pretraining)||-0.3%||-0.7%|
|Seq2Seq-Ptr (BERT encoder)||-2.2%||-0.8%|
|Seq2Seq-Ptr (RoBERTa encoder)||-3.5%||-0.7%|
|BiLSTM-CRF (no pretraining)||baseline|
|BiLSTM-CRF (pretrained LM)||+3.0%||+0.1%|
|Seq2Seq-Ptr (no pretraining)||+2.9%||-0.1%|
|Seq2Seq-Ptr (BERT encoder)||+0.1%||-0.2%|
|Seq2Seq-Ptr (RoBERTa encoder)||+4.9%||-0.2%|
7. Related Work
The task of semantic parsing for intent and slot detection is well established in literature. Traditionally, this was done with slot filling systems that classify the query and then label each word in the query. There were a few approaches that followed this system, using Recurrent Neural Networks (liu2016attention; mesnil2013investigation)
. Researchers have also experimented with Convolutional Neural Networks and showed good results(kim2014convolutional) and more recently, Capsule Networks (sabour2017dynamic; zhang2019joint).
Prior to the advent of deep learning models, the task of sequence labeling was tackled with the use of Conditional Random Fields (CRF)(lafferty2001conditional; jiao2006semi; peng2004chinese). CRFs learn pairwise potentials on labeling subsequent words which allow models to find more probable label sequences for a given query.
Most of this work is valid for semantic parsing for simple queries which boils down to a sequence labeling task. To handle more complex cases with hierarchical slots such as the example in Figure 2, researchers have experimented with Sequence to Sequence models and models based on Recurrent Neural Network Grammars (RNNG) (dyer2016recurrent). RNNGs were shown to perform better on complex queries than RNN or Transformer-based Sequence to Sequence models (gupta2018semantic). Researchers have also explored models involving logical forms and discourse for language representation (liang2016learning; zettlemoyer2012learning; van2018exploring).
The Pointer Generator Network in our architecture was introduced in vinyals2015pointer
. It was used in NLP applications where some words from the source sequence reappeared in the target sequence such as text summarization and style transfer(see2017get; paulus2017deep; prabhumoye2018style). They were also used to copy out of vocabulary words from the source to target in machine translation (klein2017opennmt). Our implementation of the Pointer Network is closest to the architecture in jia2016data. By using pointers to represent the source tokens and imposing no particular logical form over our target sequence, we can handle any kind of queries for parsing. This makes our architecture as expressive as logical forms, while also being able to learn as easily as simple slot tagging systems.
We propose a unified architecture for the task of semantic parsing for different kinds of queries. We show that our architecture matches or outperforms existing approaches across multiple datasets : internal music and video datasets, SNIPS, ATIS, and Facebook TOP. We significantly outperform the current state of the art models on the public datasets TOP (3.3%), SNIPS (7.7%), and ATIS (4.5%).
We describe how to apply this architecture to both simple queries and complex queries with hierarchical and nested slots. We also describe how to formulate any set of queries with non-conforming grammars to work with our architecture, making this model applicable to many different types of semantic parsing. We leave the non-conforming grammar task to future work.