Spoken Language Understanding on the Edge

10/30/2018 ∙ by Alaa Saade, et al. ∙ 0

We consider the problem of performing Spoken Language Understanding (SLU) on small devices typical of IoT applications. Our contributions are twofold. First, we outline the design of an embedded, private-by-design SLU system and show that it has performance on par with cloud-based commercial solutions. Second, we release the datasets used in our experiments in the interest of reproducibility and in the hope that they can prove useful to the SLU community.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spoken Language Understanding (SLU) is the task of extracting meaning from a spoken utterance. Over the last years, thanks in part to steady improvements brought by deep learning approaches to Automatic Speech Recognition (ASR) 

[1], voice interfaces implementing SLU have greatly evolved from spotting limited and predetermined keywords to understanding arbitrary formulations of a given intention, and are becoming ubiquitous in connected devices. Most current solutions however offload their processing to the cloud, where computationally demanding engines can be deployed. As an example, the ASR engine achieving human parity in [1]

is a combination of several neural networks, each containing several hundreds of millions of parameters, and large-vocabulary language models made of several millions of n-grams. The size of these models, along with the computational resources necessary to run them in real-time, make them unfit for deployment on small devices. Running SLU on the edge (i.e. embedding the engine directly on the device without resorting to the cloud) however offers several advantages. First, on-device processing removes the need to send speech, or other personal data to third-party servers, therefore guaranteeing a high level of privacy. We show in section 

3.1 how an embedded SLU model can be personalized on device using user data that stays private. Additional benefits include a reduction in latency and resilience to cloud outage [2]. In this paper, we describe the Snips Voice Platform, a SLU system that runs directly on device, therefore offering all the advantages of edge computing, and has performance on par with commercial, cloud-based solutions.

1.1 Outline and main results

A typical SLU system has three main components. First, an Acoustic Model (AM) maps a spoken utterance to a sequence of phones (units of speech). Second, a Language Model (LM) maps a sequence of phones to a likely sentence. Third, a Natural Language Understanding (NLU) engine extracts from the sentence the intent of the user (e.g. querying the weather forecast) and the slots qualifying her query (e.g. a city in the case of a weather forecast query). Our main contribution is to outline the design of an embedded SLU system that achieves performances on par with cloud-based solutions, and is efficient enough to run on IoT devices as small as the Raspberry Pi 3, with 1GB of RAM and 1.4GHz CPU. This is achieved by optimizing a trade-off between accuracy and computational efficiency when training the AM, and by specializing the LM and NLU components in order both to reduce their size and increase their in-domain accuracy. We are also releasing publicly111https://research.snips.ai/datasets/spoken-language-understanding the datasets used for the numerical evaluation of section 5 in the hope that they can be useful to the SLU community. The NLU component of the Snips Voice Platform is already open source [3], while the other components will be open-sourced in the future. Our SLU models can be trained through a web console, at no cost for non-commercial use.

Layer Type nn256 nn512 nn768
TDNN 256 512 768
TDNN 256 512 768
TDNN 256 512 768
LSTMP(rec: -3)
TDNN 256 512 768
TDNN 256 512 768
LSTMP(rec: -3)
Num. params 2.6M 8.7M 15.4M
Table 1:

Network architecture. TDNN refers to a Time-Delay layer with Batch Normalization and ReLU activation. LSTMP means Long Short-Term Memory with Projection layer. Each layer depends on a context: the recurrent connections skip 3 frames in LSTMP layers, and the TDNN layers consider inputs from the various time steps indicated in the table. A projection layer size of

is denoted pN.

1.2 Relation to previous work

Recent interest in mobile speech recognition has led to new work on ASR model compression [4]. In this work, personal data is incorporated dynamically in the language model using a class-based model similar to the one we introduce in the following. The authors however do not study the performance of their system in terms of SLU performance but rather on a large-vocabulary speech recognition task. On the other hand, we introduce specialized models assessed through end-to-end SLU metrics, which are arguably a better proxy for user experience [5]. Another line of work is interested in embedded speech commands, leveraging small models that can understand a small range of predefined commands, usually limited to one or two words [6]. These approaches however cannot handle the variety of natural language interactions addressed in the following.

Model dev-clean dev-other test-clean test-other
nn256 7.3 19.2 7.6 19.6
nn512 6.4 17.1 6.6 17.6
nn768 6.4 16.8 6.6 17.5
KALDI 3.87 10.22 4.17 10.57
Table 2: Decoding accuracy of neural networks of different sizes (Word Error Rate, %), on the splits of the LibriSpeech dataset. KALDI denotes the performance of the state-of-the-art Kaldi recipe on this dataset.

2 Acoustic modeling

Our AM is trained so as to optimize a trade-off between accuracy and performance. We use training datasets consisting of a few thousand hours of audio data with corresponding transcripts. Noisy, far-field conditions with reverberation are simulated by augmenting the data with thousands of virtual rooms with random microphone and speaker locations. We train deep neural AMs using the Kaldi toolkit [7]. Our typical architectures have layers and are trained with the lattice-free Maximum Mutual Information criterion [8]

, using natural gradient descent with a learning rate of 0.0005. By varying the number of neurones of each layer of the AM, we obtain models of different sizes with different computational requirements (see table 

1). The AM is chosen to offer near state-of-the-art performance, while running on real time with acceptable memory requirements dependent on the target hardware. We assess the accuracy of the various architectures on a standard large-vocabulary speech recognition task with the LibriSpeech dataset using the accompanying LM [9] (see table 2). In the following, we consider the nn256 model which is close to nn512 in accuracy but six times smaller, and runs in real time on a Raspberry Pi 3. We show in the next section how to compensate this loss in accuracy by specializing the subsequent components of the SLU pipeline.

3 Language modeling

The mapping from the output of the acoustic model to likely word sequences is done via a Viterbi search in a weighted Finite State Transducer (wFST) [10], called ASR decoding graph in the following. Formally, the decoding graph may be written as the composition of four wFSTs,


where denotes transducer composition,

represents Hidden Markov Models (HMMs) modeling context-dependent phones,

represents the context-dependency,

is the lexicon and

is the LM, typically a bigram or a trigram model represented as a wFST. The compositions are carried out right to left, with determinization and minimization operations [10] applied at each step to optimize decoding. We refer the interested reader to [10, 7] and references therein for background on wFSTs and their use in speech recognition. In the following, we focus on the construction of the G transducer, encoding the LM, from a domain-specific dataset.

3.1 Language model adaptation

Our ASR engine is required to understand arbitrary formulations of a finite set of intents described in a dataset. Generalization to unseen queries is enabled by using a statistical n-gram LM [11] allowing to mix parts of the training queries to create new ones, and by using class-based language modeling [12] so slot values can be swapped. More precisely, we start by building patterns abstracting the queries of the dataset by replacing all occurrences of each slot by a symbol. For example, the query “Play some music by (The Rolling Stones)[artist]” is abstracted to “Play some music by ARTIST”. An n-gram model is then trained on the resulting set of patterns, which is then converted to a wFST called  [10]. Next, for each slot where and is the number of slots, an acceptor is defined to encode the values the slot can take. can either encode an n-gram model trained on a gazetteer (i.e. a list of possible values), or a generative grammar exhaustively describing the construction of any slot value (e.g. for numbers or dates). Denoting wFST replacement as “Replace”, we have as in [13]


The resulting SLU system is specialized, and supported on a domain-specific vocabulary. Indeed, while a sufficient amount of specific training data may guarantee sampling the important words which allow to discriminate between different intents, it will in general prove unable to correctly sample filler words from general spoken language. In order to detect and remove out of vocabulary words (OOV), we use an approach based on so-called confusion networks [14]

to represent decoded words along with their posterior probability. We tag decoded words as unknown if their posterior probability is lower than some threshold.

3.2 Dynamic language model

Close field Far field
Quantity Snips Google Snips Google
Intent classification (F1) 0.92 0.89 0.84 0.86
Perfect parsing (%) 0.84 0.79 0.72 0.73
Table 3: End-to-end generalization performance on the “SmartLights” assistant: comparison with Google’s Dialogflow cloud service on a 5-fold cross-validation experiment. For each intent (leftmost column), we give the F1-score in intent classification and the percentage of perfectly parsed utterances such that both intent and slots match the ground true supervision.

On small devices, computing the decoding graph (1) can result in a prohibitively large wFST for larger assistants. For this reason, we build a dynamic language model by precomputing and , and composing them lazily [15]. The states and transitions of the decoding graph are thus computed on demand during inference, notably speeding up the building of the LM. Additionally, employing lazy composition allows to break the decoding graph into two pieces, with sizes typically much smaller than the equivalent, statically-composed . When using a dynamic LM, a better composition algorithm must be used in order to keep the decoding fast enough. We use composition filters [15] such as look-ahead filters followed by labelreachability filters with weight and label pushing, allowing to discard inaccessible and costly decoding hypotheses early in the decoding. Crucially, we ensure that the lexicon verifies the so-called C1P property (i.e. each symbol has a unique pronunciation [16]) by associating a unique symbol for each pair (word, pronunciation). Finally, the Replace operation of equation (2) is performed upon loading the model from disk. This allows to further break the decoding graph into smaller distinct pieces: the transducer mapping the output of the acoustic model to words, the query language model , and the slots’ language models . Breaking down the LM into smaller, separate parts makes it possible to efficiently update it. In particular, performing on-device injection of new values in the LM becomes straightforward, enabling users to customize their embedded SLU engine. For instance, if we consider an assistant dedicated to making phone calls (“call (Jane Doe)[contact]”), the user’s list of contacts could be added to the values of the slot “contact” without this sensitive data ever leaving the device. To do so, the new words and their pronunciations are first added to the transducer, using an embedded Grapheme to Phoneme engine (G2P) to compute the missing pronunciations. The new slot values are then added to the corresponding slot wFST by updating the counts of the n-grams. The time required for the complete slot value injection procedure ranges from a few seconds for small assistants, to a few dozen seconds for larger assistants supporting a vocabulary comprising tens of thousands of words.

Close field Far field
Language Provider Tier 1 Tier 2 Tier 3 Average Tier 1 Tier 2 Tier 3 Average
English Snips 71.27 67.73 67.21 68.73 42.08 39.36 35.58 39.01
Google 68.78 37.90 36.74 47.81 58.82 28.85 27.21 38.29
French Snips 78.20 74.14 73.06 75.13 57.49 53.56 53.89 54.98
Google 61.04 33.51 32.38 42.31 36.24 15.83 13.47 21.85
Table 4: Music assistants: percentage of perfectly parsed utterances of the form “I want to listen to #ARTIST”. The tiers are created using a ranking of 10k artists according to their stream counts on Spotify. Tier 1 corresponds to artists with rank between 1 and 1,000, tier 2 have ranking between 4,500 and 5,500 and tier 3 between 9,000 and 10,000. The Snips SLU system is trained on a complete music assistant handling several interactions with a smart speaker (see text). The results labeled “Google” correspond to replacing the Snips ASR component by Google’s Speech Recognition API.

4 Natural language understanding

The NLU component performs intent classification followed by slot filling. The intent classification is implemented with a logistic regression trained on the queries from every intent. The slot-filling step consists in several linear-chain Conditional Random Fields (CRFs) 


, each of them trained for a specific intent. Once the intent is extracted by the intent classifier, the corresponding slot filler is used to extract slots from the query. While CRFs are a standard approach for slot filling 

[18], we note that more computationally demanding approaches based on deep learning models have been recently proposed [19]. Our experiments showed that these approaches do not yield any significant gain in accuracy in the typical training size regime of custom voice assistants (a few hundred queries). Data sparsity is addressed by integrating features based on precomputed word clusters, obtained by clustering word embeddings computed on a large independent corpus, effectively reducing the vocabulary size from typically 50K words to a few hundred clusters. Finally, gazetteer features are used, based on parsers built from the slot values provided in the training data. Consistently with the n-gram slot models in the LM (see section 3.1), these parsers can match partial slot values. When injecting personal user data (see section 3.1), these gazetteer parsers are augmented accordingly to cover the new slot values. The NLU component has been benchmarked and proven to be competitive against various commercial solutions [20].

5 Numerical Results

In this section, we present an end-to-end evaluation of both our SLU system and a cloud-based commercial solution, on two domains of increasing complexity posing different challenges. We start by detailing our data collection protocol. In the interest of reproducibility, the datasets used in the following are made publicly available (see section 1.1). The trained SLU models can be obtained through the Snips web console at no cost for non-commercial use. In our comparison with Google’s cloud services, we used the service’s built-in slots and features whenever possible in the interest of fairness. For all experiments, we fix our threshold for OOV detection to , the pattern transducer is a bigram model, while the corresponding to the gazetteer-based slots are trigrams (see section 3.1 for definitions of these quantities).

5.1 Datasets and experimental setting

Our datasets contain up to a few thousand text queries with their supervision, i.e. intent and slots, collected using an in-house data generation pipeline described in [20]. We then crowdsource the recording of these sentences and collect one spoken utterance for each text query in the dataset. Far-field datasets are created by playing these utterances with a neutral speaker and record them using a microphone array positioned at a distance of 2 meters. The aim of a SLU system is then, given one such spoken utterance, to predict the ground-true intent (intent classification) and slots. We measure the performance of our SLU system in terms of F1-score on intent classification, and percentage of perfectly parsed utterances, such that both intent and slots are recovered.

We first consider a small assistant typical of smart home use cases, the “SmartLights” assistant, comprising intents allowing to turn on or off the light, or change its brightness or color. It has a vocabulary size of approximately 400 words, and depends on three slots (room, brightness and color). Table 3 shows that we reach an accuracy similar to a commercial, cloud-based SLU solution. Our SLU system for this assistant has a total size of 15.1MB and runs in real time on a Raspberry Pi 3. We then turn to a large and complex assistant allowing to control a smart speaker through playback control (volume control, track navigation, etc), but also play music from large libraries of artists, tracks and albums. In order to allow some variability in the pronunciation of artist names or music work, we generate up to 3 pronunciations per word using a statistical G2P. In addition to the English version of the assistant, we also consider a French version which presents the additional difficulty of handling the pronunciations of many English words in French. We compute cross-language pronunciations for these words by generating up to

variations using a statistical English G2P, and then mapping their phonemes to the closest ones in the French phonology. The vocabulary of the resulting English music assistant contains more than 65k words, corresponding to 178k pronunciations, while the French assistant has more than 70k words, with 390k pronunciations. These assistants are the largest we consider, with a total size on disk of 80MB for the English version, and 112MB for the French version. They run in real time on a Raspberry Pi 3. We test these assistants on utterances of the form “play some music by #ARTIST”, where we sample “#ARTIST” from a publicly available list of the most streamed artists on Spotify. This experiment is representative of the difficulty of this SLU task, and additionally allows to estimate the performance of ASR systems as a function of the popularity of artists. To this end, we consider two sets of experiments. In the first, we perform inference using a full Snips SLU engine and compute the fraction of correctly parsed utterances. In a second experiment, we replace Snips ASR by Google’s Speech Recognition API. We find (see table 

4) that the performance of cloud-based, general-purpose solutions such as Google’s ASR decay rapidly with the ranking of the artist. By contrast, our class-based approach outlined in section 3.1 assigns similar weights to all artists, resulting in more robust performance even for less popular artists. Additionally, in practice, our SLU system can incorporate user-specific tastes through value injection (see section 3.2), e.g. by connecting privately to a user’s favorite streaming service.

6 Conclusion

SLU on the edge can achieve the accuracy of cloud-based solutions without compromising on user privacy. Future work includes further reducing the size of our models to run them on even smaller devices, and leveraging private federated learning to improve performance using in-domain data.