1 Introduction
Spoken language understanding (SLU) is the task of inferring the semantics of userspoken utterances. Typically, SLU is performed by populating an utterance interpretation with the results of three subtasks: domain classification (DC), intent classification (IC), and named entity recognition (NER) [tur2011spoken]
. The conventional approach to SLU uses two distinct components to sequentially process a spoken utterance: an automatic speech recognition (ASR) model that transcribes the speech to a text transcript, followed by a natural language understanding (NLU) model that predicts the domain, intent and entities given the transcript
[lee2010recent]. Recent applications of deep learning approaches to both ASR
[hinton2012deep, graves2013speech, bahdanau2016end] and NLU [xu2014contextual, ravuri2015recurrent, sarikaya2014application] have improved the accuracy and efficiency of SLU systems and driven the commercial success of voice assistants such as Amazon Alexa and Google Assistant. However, the modular design of traditional SLU systems admits a major drawback. The two component models (ASR and NLU) are trained independently with separate objectives. Errors encountered in the training of either model do not inform the other. As a result, ASR errors produced during inference might lead to incorrect NLU predictions since the ASR errors are absent from the clean transcripts on which NLU models are trained. An increasingly popular approach to address this problem is to employ models that predict SLU output directly from a speech signal input [lugosch2019speech, haghani2018audio, chen2018spoken, serdyuk2018towards]. We refer to this class of SLU models as speechtointerpretation (STI). STI models can be trained with a single optimization objective to recognize the semantics of a spoken utterance; some studies report better performance by pretraining initial layers of the model on transcribed speech followed by finetuning with semantic targets [lugosch2019speech, chen2018spoken]. The semantic target types, predicted by previously published models, vary from just the intent [lugosch2019speech, chen2018spoken] to the full SLU interpretation [haghani2018audio]. A majority of these works report nearperfect performance metrics for DC or IC on independently collected datasets. For example, IC accuracy values of [lugosch2019speech] and [chen2018spoken], DC accuracy of [serdyuk2018towards], and F1scores of and for DC and IC respectively [haghani2018audio] have been reported. While these studies demonstrate successful applications of STI architectures, their results are often benchmarked on datasets with limited complexity. Most publicly available SLU datasets used to train STI models contain only a moderate number of unique utterances and intent classes [lugosch2019speech, saade2018spoken, picovoice]. For example, the Fluent Speech dataset [lugosch2019speech] only contains 248 unique utterance phrasings from one domain. Similarly, transcriptions within some intent classes in the dataset used in [chen2018spoken] are keywords such as ‘true’ and ‘false’. As a result, the question of how well STI models generalize to broader use cases remains unexplored. In this work, we study the relationship between the performance of STI models and the inherent difficulty of their use cases. We pose the following research questions:
[label=Q0]

Can the semantic complexity of a dataset be quantified to understand the difficulty of an STI task?

How does varying the semantic complexity measure of a dataset affect the STI model performance?
We propose several empirical measures of the semantic complexity of an SLU dataset (Q1). These measures help quantify the difficulty of an SLU task underlying a particular use case. We ground the nearperfect performance metrics reported in literature with complexity values for their training datasets and show that they have low semantic complexity. We also perform experiments to study the relationship between our proposed complexity measures and the performance of STI models (Q2). We advocate for reporting the complexity values of SLU datasets along with the performance of STI models to help uncover the applicability of the proposed architectures. In the following sections, we describe our semantic complexity measures, report experimental results, and conclude with a discussion and suggestions for future work in this area.
2 Semantic Complexity
Our first research question (Q1) poses if quantifying the semantic complexity of an SLU dataset can indicate the difficulty of the underlying SLU task. We propose several datadriven measures of semantic complexity that are computed using the SLU dataset transcriptions. These measures can be grouped into two broad categories: lexical measures and geometric
measures. The major difference between these groups is the representation used for the transcriptions. The lexical measures use common lexical features, like ngrams, whereas the geometric measures use vectors encodings to represent the transcription text. In this section we define each measure and in Section
3 we show that STI model performance is correlated with semantic complexity computed using these measures.2.1 Lexical Measures
The lexical measures that we consider are vocabulary size, number of
unique transcripts, and ngram entropy. The vocabulary size measures
the number of unique words in the dataset transcripts.
nGram Entropy: The ngram entropy measures the randomness of
the dataset transcripts over its constituent ngrams, . It is
computed by the equation:
where is the set of unique ngrams in the dataset and
is the probability of ngram
occurring in . A dataset that contains only a single unique ngram has an entropy of 0, whereas a dataset with no repeated ngrams has an entropy of(uniform distribution over
). Larger values of ngram entropy represent higher randomness and variety in the utterance patterns, indicating larger semantic complexity.2.2 Geometric Measures
For this class of measures, the dataset transcripts are first encoded in a
highdimensional Euclidean space, then subsequent analyses are performed. We
propose two measures under this category: minimum spanning tree (MST)
complexity and adjusted Rand index (ARI) complexity. First, we explain
how to compute these two measures given any transcript encoding. Second, we
describe the encoding methods that we used. These measures are computed with the
assumption that each example in the dataset has a
semantic label, such as an intent label.
Minimum Spanning Tree Complexity (MST): We hypothesize that
examples with similar transcript encodings yet different semantic labels will
confuse an STI model. To capture this notion, we compute a minimum spanning tree
(MST) of the transcript encodings. The MST is computed over a complete graph
where the vertices correspond to the transcripts and the edges are weighted by
the distance between their encodings. The edges of this MST connect pairs of
examples with similar transcript encodings, though some pairs will have
different semantic labels. We define the MST complexity measure as the
cumulative weight of MST edges that connect examples with different labels using
the expression:
where and are the labels of examples with transcript encodings
and , is the weight of the edge between and , and
is the total edge weight of the
MST (see Figure 1). The MST complexity ranges from to , with
indicating the highest semantic
complexity.
Adjusted Rand Index Complexity (ARI): As a complementary view to the MST complexity, we hypothesize that semantic classes whose transcripts do not cluster together in the encoding space could be confusing to an STI model. To capture this notion, we first perform an agglomerative clustering^{1}^{1}1We used the completelinkage criteria with the cosine distance. on the transcript encodings. After performing the clustering, each transcript has a semantic class label assigned by the data annotator and a cluster label assigned by the clustering algorithm. The adjusted Rand index (ARI) [hubert1985comparing] quantifies the agreement between the two labelings. It is calculated with the equation:
where is the Rand index, is the number of example
pairs that have the same class and cluster labels, is the number of pairs
that have different class and cluster labels, and is the number of examples.
The ARI score ranges from to . We define the ARI complexity measure as
, that maps the scores from to , with
indicating the highest semantic complexity.
Transcript Encodings: We used two different methods to
produce the transcript encodings: (1) the standard unsupervised method from the
field of topic modeling  Latent Dirichlet Allocation
(LDA) [blei2003latent] and (2) the contemporary sentence encoding method
from [cer2018universal].
Latent Dirichlet Allocation (LDA) is an unsupervised method
used to discover ‘latent’ or ‘hidden’ topics from a collection of documents
(transcripts in this case) [blei2003latent]. LDA fits a hierarchical
Bayesian model so that each transcript is represented as a distribution over
topics. The topicdistribution vectors learned by LDA for each transcript in the
dataset become the encoding used for further analysis.
The Universal Sentence Encoder (USE) is a pretrained
transformer model [cer2018universal]. It outputs a 512dimensional encoding
vector for each transcript in the dataset which we use for further
analysis.
Given the aforementioned methods, we compute four geometric complexity
measures for a dataset: MSTLDA, MSTUSE, ARILDA and ARIUSE.^{2}^{2}2Note the
proposed geometric complexity measures can be applied to any class of
encodings beyond LDA and USE.
3 Experiments and Results
Motivated by the research questions Q1 and Q2 stated in section 1, we perform our experiments in two stages.
3.1 Quantifying Semantic Complexity
In line with our first research question and to draw a comparison between previously published work and our results, we compute the semantic complexity of public SLU datasets. We also address whether the IC accuracy reported for the STI models on these datasets is expected in light of their complexity values. We selected three public SLU datasets for this stage: Fluent Speech Commands (FSC) [lugosch2019speech], Picovoice (Pico.) [picovoice] and Snips Smart Lights [saade2018spoken] (Snips). In the FSC and Snips datasets, the semantic labels correspond to intents, whereas in the Pico. dataset semantic labels were taken to be the value of the ‘coffeeDrink’ slot since all the examples had the same intent label (‘orderDrink’). Furthermore, there are no text transcriptions available for this dataset, so we used a speechtotext engine^{3}^{3}3https://aws.amazon.com/transcribe to generate the transcriptions. Table 1 shows the complexity values for these datasets. We computed three ngram entropy values: unigram, bigram and trigram entropy. We do not train a new STI model on these datasets and instead note the IC accuracy reported in their original publication: FSC () and Picovoice (). IC accuracy with an STI model is not available for Snips; the IC accuracy (close field: 91.72% and far field: 83.56%) reported in [saade2018spoken] was obtained with a traditional, modularized SLU design. The trend of average lexical and geometric measures among these datasets suggest the following ascending order of semantic complexity: (1) FSC, (2) Pico. and (3) Snips. The trend is wellconnected to their respective model performances, reflected in the the nearperfect IC accuracy numbers for FSC () and Picovoice () when compared to Snips (best nonSTI performance ). We use these numbers as a basis of comparison for our next stage of experiments.
Semantic Complexity Measure  Dataset  
FSC  Pico  Snips  
Lex.  vocabulary  124  163  445  
unique transcripts  248  592  1639  
entropy  unigram  5.5  5.5  6.2  
bigram  7.2  7.3  9.1  
trigram  7.9  8.8  10.9  
average  6.9  7.2  8.7  
Geo.  MST  LDA  0.3  0.8  0.8 
USE  0.0  0.5  0.3  
average  0.2  0.6  0.6  
ARI  LDA  0.02  0.1  0.4  
USE  0.04  0.2  0.2  
average  0.03  0.1  0.3 
3.2 Model Performance Versus Semantic Complexity
For our second research question, we want to determine if there is a correlation
between the semantic complexity of datasets and the performance of STI models
trained on these datasets.
In order to do this, we create a large, proprietary SLU dataset and apply a data
filtration scheme to generate a
sequence of SLU datasets of decreasing semantic complexity. We study the performance of two published STI models [lugosch2019speech, haghani2018audio] on these filtered datasets to analyze how strongly the model performance relates to the semantic complexity.
Dataset Filtration: We created a proprietary dataset by
extracting and annotating a window of production traffic of a commercial voice
assistant system. It contains a total of 1.6 million utterances with
unique transcriptions. From this original dataset, we created a sequence of
subdatasets that have decreasing semantic complexity. The algorithm that
removes examples from a dataset to reduce the semantic complexity depends on the
particular complexity measure. For example, we reduce the ngram entropy of a
dataset by discarding examples whose transcriptions contain an ngram from the
least frequent set of ngrams in the dataset. We found that filtering the
dataset with this method not only decreased the ngram entropy but
also decreased the other complexity measures. Therefore, we performed our experiments on the filtered datasets obtained using this filtering scheme.
Semantic Complexity  Relative Intent Classification Accuracy  
Lexical Measures 

Model in [lugosch2019speech] (Pretrained) 









11.6  6744  211585  0.52  67.5%  87.4%  94.0%  63.8%  
7.5  456  30504  0.39  72.1%  91.4%  96.4%  89.0%  
5.8  70  907  0.29  80.9%  94.8%  98.3%  94.9%  
3.1  10  16  0.15  100.0%  100.0%  100.0%  100.0% 
Model Performance: We experiment with two popular STI model architectures: a stacked neural model with pretrainable acoustic components from [lugosch2019speech] and a sequencetosequence multitask model proposed by [haghani2018audio]. The acoustic component of the first model [lugosch2019speech] consists of a ‘phoneme layer’ followed by a ‘word layer’, and is pretrained on the LibriSpeech corpus [panayotov2015librispeech]. This component can be finetuned following three unfreezing schedules: no unfreezing, unfreezing only the wordlayer, and unfreezing both the word and phoneme layers. The multitask model from [haghani2018audio] uses a shared encoder to make both transcript and semantic predictions. We use multiple models and training strategies in order to observe patterns that are independent of these choices. We train both the architectures on our filtered datasets, following their proposed training strategies and experimental configurations. A grid search is performed on different hyperparameter combinations with early stopping done on the validation set. For the stacked STI model in [lugosch2019speech], we use the pretrained acoustic component provided by the authors and finetune the model on our datasets. The multitask model from [haghani2018audio] was randomly initialized and trained from scratch. We choose the SLU task (IC) that is common to both of these models. Table 2 shows the IC accuracy relative to the filtered dataset corresponding to the last row of the Table for these models (across different unfreezing schemes for [lugosch2019speech]). For the sake of brevity, we only report the results for 4 filtered datasets, note the average ngram entropy score (across unigram, bigram and trigram entropy) and the average MST score. However, all observations and trends in the table were also noticed for the average ARI measure across a total of 15 data filtration steps.
Regardless of the model architecture and unfreezing schemes, the IC accuracy of the models consistently increases as the semantic complexity of the dataset (both lexical and geometric) decreases. This trend holds for all the data filtration levels, even for the ones not shown in the table. Figure 2 depicts the relationship between relative IC accuracy and average entropy across all the filtered datasets. We notice that IC accuracy is correlated with average entropy, depicted by the following R^{2} values: model from [lugosch2019speech]: No Unfreezing: , Unfreezing Word Layer: , Unfreezing All Layers: ; model from [haghani2018audio]: . This answers our second research question  STI model performance increases as dataset semantic complexity decreases. This trend between the computed complexity value and the model performance validates that our proposed complexity metrics capture the inherent semantic content of the dataset. To determine whether the smaller sizes of filtered datasets confounded the correlation between semantic complexity and IC accuracy, we performed a set of experiments to remove random subsets of the original dataset. In one experiment using the model [lugosch2019speech] with full unfreezing, removing a random sample of the dataset resulted in a relative decrease in average entropy and relative decrease in IC accuracy. This demonstrates that reducing the complexity of the dataset and not just removing a random sample accounts for the relationship depicted in Figure 2. In line with the findings by the authors in [lugosch2019speech] on the FSC dataset, we observe that starting from a pretrained model leads to better overall performance than starting from a randomly initialized model (the multitask model [haghani2018audio] in this case). However, contrary to their results, we observe a significant jump in accuracy if the pretrained layers are finetuned. This difference is larger for datasets of higher complexity (first row  difference in accuracy) than the ones with lower complexity (last row  difference in accuracy). One reason we hypothesize this happens is that with the low complexity of the dataset, especially the public dataset FSC that has unique transcriptions, the dense classification layer is enough to capture that dataset specific patterns after the acoustic component is pretrained on an external dataset. However, this external knowledge alone is not enough to achieve comparable performance for a complex dataset, unless the model is finetuned to the distinct acoustic and linguistic patterns present in the dataset. We found that the public datasets analyzed in Table 1 had, on average, lower semantic complexity than our initial dataset. In order to obtain average entropy values similar to the FSC, Pico, and Snips datasets, approximately , , and , respectively, of unique transcriptions had to be removed from our initial dataset. In our own experiments, the IC accuracies achieved on these three filtered datasets using the STI model from [lugosch2019speech] were comparable to the accuracies reported by their respective authors. Therefore, the three public SLU datasets analyzed by us actually fall on the lower end of the semantic complexity spectrum when compared to the data obtained from a commercial voice assistant. Moreover, our semantic complexity measures can indicate how well an STI architecture will perform on a given dataset. We expect that for broader use cases, such as recognizing a rich variety of utterance phrasings from multiple domains, the dataset semantic complexity will be similar to the most complex dataset that we consider and the performance of previously published models will degrade. Therefore, when reporting STI model performance, it is important to quantify the complexity of the task. We have shown this can be done using our proposed measures of semantic complexity.
4 Conclusion
We propose several measures that can be used to quantify the semantic complexity of the training dataset of an SLU system that employs an STI architecture. These measures signal the difficulty of the associated SLU task. We show that our complexity measures correlate well with STI model performance, agnostic of the model architecture, such that as the computed complexity value decreases, the classification accuracy increases. Our experiments reveal that the high classification accuracy () reported on public datasets for STI models, is associated with a low complexity value for those datasets. Our current observations indicate that targeted use cases associated with datasets of low complexity values are good candidates for STI architectures. However, as the use cases get more complex, the STI performance gains diminish. Therefore, it is important to contextualize any new STI model performance claims with the semantic complexity values of the underlying training dataset to understand the scope of its applicability. Notably, our complexity measures help us understand the performance of STI models even though they are computed from transcripts, which are not an input to the STI model. We expect that further analysis which takes acoustic features of the speech input into account may also help us understand the performance of STI models. In the future, we aim to utilize these computed semantic complexity measures and representations as an additional signal to STI models to test whether they improve SLU performance. It would also be interesting to combine different complexity measures as a more comprehensive metric which can generalize across all SLU tasks.
Comments
There are no comments yet.