Spoken Language Understanding (SLU) is the task of extracting meaning from a spoken utterance. Over the last years, thanks in part to steady improvements brought by deep learning approaches to Automatic Speech Recognition (ASR), voice interfaces implementing SLU have greatly evolved from spotting limited and predetermined keywords to understanding arbitrary formulations of a given intention, and are becoming ubiquitous in connected devices. Most current solutions however offload their processing to the cloud, where computationally demanding engines can be deployed. As an example, the ASR engine achieving human parity in 
is a combination of several neural networks, each containing several hundreds of millions of parameters, and large-vocabulary language models made of several millions of n-grams. The size of these models, along with the computational resources necessary to run them in real-time, make them unfit for deployment on small devices. Running SLU on the edge (i.e. embedding the engine directly on the device without resorting to the cloud) however offers several advantages. First, on-device processing removes the need to send speech, or other personal data to third-party servers, therefore guaranteeing a high level of privacy. We show in section3.1 how an embedded SLU model can be personalized on device using user data that stays private. Additional benefits include a reduction in latency and resilience to cloud outage . In this paper, we describe the Snips Voice Platform, a SLU system that runs directly on device, therefore offering all the advantages of edge computing, and has performance on par with commercial, cloud-based solutions.
1.1 Outline and main results
A typical SLU system has three main components. First, an Acoustic Model (AM) maps a spoken utterance to a sequence of phones (units of speech). Second, a Language Model (LM) maps a sequence of phones to a likely sentence. Third, a Natural Language Understanding (NLU) engine extracts from the sentence the intent of the user (e.g. querying the weather forecast) and the slots qualifying her query (e.g. a city in the case of a weather forecast query). Our main contribution is to outline the design of an embedded SLU system that achieves performances on par with cloud-based solutions, and is efficient enough to run on IoT devices as small as the Raspberry Pi 3, with 1GB of RAM and 1.4GHz CPU. This is achieved by optimizing a trade-off between accuracy and computational efficiency when training the AM, and by specializing the LM and NLU components in order both to reduce their size and increase their in-domain accuracy. We are also releasing publicly111https://research.snips.ai/datasets/spoken-language-understanding the datasets used for the numerical evaluation of section 5 in the hope that they can be useful to the SLU community. The NLU component of the Snips Voice Platform is already open source , while the other components will be open-sourced in the future. Our SLU models can be trained through a web console, at no cost for non-commercial use.
Network architecture. TDNN refers to a Time-Delay layer with Batch Normalization and ReLU activation. LSTMP means Long Short-Term Memory with Projection layer. Each layer depends on a context: the recurrent connections skip 3 frames in LSTMP layers, and the TDNN layers consider inputs from the various time steps indicated in the table. A projection layer size ofis denoted pN.
1.2 Relation to previous work
Recent interest in mobile speech recognition has led to new work on ASR model compression . In this work, personal data is incorporated dynamically in the language model using a class-based model similar to the one we introduce in the following. The authors however do not study the performance of their system in terms of SLU performance but rather on a large-vocabulary speech recognition task. On the other hand, we introduce specialized models assessed through end-to-end SLU metrics, which are arguably a better proxy for user experience . Another line of work is interested in embedded speech commands, leveraging small models that can understand a small range of predefined commands, usually limited to one or two words . These approaches however cannot handle the variety of natural language interactions addressed in the following.
2 Acoustic modeling
Our AM is trained so as to optimize a trade-off between accuracy and performance. We use training datasets consisting of a few thousand hours of audio data with corresponding transcripts. Noisy, far-field conditions with reverberation are simulated by augmenting the data with thousands of virtual rooms with random microphone and speaker locations. We train deep neural AMs using the Kaldi toolkit . Our typical architectures have layers and are trained with the lattice-free Maximum Mutual Information criterion 
, using natural gradient descent with a learning rate of 0.0005. By varying the number of neurones of each layer of the AM, we obtain models of different sizes with different computational requirements (see table1). The AM is chosen to offer near state-of-the-art performance, while running on real time with acceptable memory requirements dependent on the target hardware. We assess the accuracy of the various architectures on a standard large-vocabulary speech recognition task with the LibriSpeech dataset using the accompanying LM  (see table 2). In the following, we consider the nn256 model which is close to nn512 in accuracy but six times smaller, and runs in real time on a Raspberry Pi 3. We show in the next section how to compensate this loss in accuracy by specializing the subsequent components of the SLU pipeline.
3 Language modeling
The mapping from the output of the acoustic model to likely word sequences is done via a Viterbi search in a weighted Finite State Transducer (wFST) , called ASR decoding graph in the following. Formally, the decoding graph may be written as the composition of four wFSTs,
where denotes transducer composition,
represents Hidden Markov Models (HMMs) modeling context-dependent phones,represents the context-dependency,
is the lexicon andis the LM, typically a bigram or a trigram model represented as a wFST. The compositions are carried out right to left, with determinization and minimization operations  applied at each step to optimize decoding. We refer the interested reader to [10, 7] and references therein for background on wFSTs and their use in speech recognition. In the following, we focus on the construction of the G transducer, encoding the LM, from a domain-specific dataset.
3.1 Language model adaptation
Our ASR engine is required to understand arbitrary formulations of a finite set of intents described in a dataset. Generalization to unseen queries is enabled by using a statistical n-gram LM  allowing to mix parts of the training queries to create new ones, and by using class-based language modeling  so slot values can be swapped. More precisely, we start by building patterns abstracting the queries of the dataset by replacing all occurrences of each slot by a symbol. For example, the query “Play some music by (The Rolling Stones)[artist]” is abstracted to “Play some music by ARTIST”. An n-gram model is then trained on the resulting set of patterns, which is then converted to a wFST called . Next, for each slot where and is the number of slots, an acceptor is defined to encode the values the slot can take. can either encode an n-gram model trained on a gazetteer (i.e. a list of possible values), or a generative grammar exhaustively describing the construction of any slot value (e.g. for numbers or dates). Denoting wFST replacement as “Replace”, we have as in 
The resulting SLU system is specialized, and supported on a domain-specific vocabulary. Indeed, while a sufficient amount of specific training data may guarantee sampling the important words which allow to discriminate between different intents, it will in general prove unable to correctly sample filler words from general spoken language. In order to detect and remove out of vocabulary words (OOV), we use an approach based on so-called confusion networks 
3.2 Dynamic language model
|Close field||Far field|
|Intent classification (F1)||0.92||0.89||0.84||0.86|
|Perfect parsing (%)||0.84||0.79||0.72||0.73|
On small devices, computing the decoding graph (1) can result in a prohibitively large wFST for larger assistants. For this reason, we build a dynamic language model by precomputing and , and composing them lazily . The states and transitions of the decoding graph are thus computed on demand during inference, notably speeding up the building of the LM. Additionally, employing lazy composition allows to break the decoding graph into two pieces, with sizes typically much smaller than the equivalent, statically-composed . When using a dynamic LM, a better composition algorithm must be used in order to keep the decoding fast enough. We use composition filters  such as look-ahead filters followed by labelreachability filters with weight and label pushing, allowing to discard inaccessible and costly decoding hypotheses early in the decoding. Crucially, we ensure that the lexicon verifies the so-called C1P property (i.e. each symbol has a unique pronunciation ) by associating a unique symbol for each pair (word, pronunciation). Finally, the Replace operation of equation (2) is performed upon loading the model from disk. This allows to further break the decoding graph into smaller distinct pieces: the transducer mapping the output of the acoustic model to words, the query language model , and the slots’ language models . Breaking down the LM into smaller, separate parts makes it possible to efficiently update it. In particular, performing on-device injection of new values in the LM becomes straightforward, enabling users to customize their embedded SLU engine. For instance, if we consider an assistant dedicated to making phone calls (“call (Jane Doe)[contact]”), the user’s list of contacts could be added to the values of the slot “contact” without this sensitive data ever leaving the device. To do so, the new words and their pronunciations are first added to the transducer, using an embedded Grapheme to Phoneme engine (G2P) to compute the missing pronunciations. The new slot values are then added to the corresponding slot wFST by updating the counts of the n-grams. The time required for the complete slot value injection procedure ranges from a few seconds for small assistants, to a few dozen seconds for larger assistants supporting a vocabulary comprising tens of thousands of words.
|Close field||Far field|
|Language||Provider||Tier 1||Tier 2||Tier 3||Average||Tier 1||Tier 2||Tier 3||Average|
4 Natural language understanding
The NLU component performs intent classification followed by slot filling. The intent classification is implemented with a logistic regression trained on the queries from every intent. The slot-filling step consists in several linear-chain Conditional Random Fields (CRFs)
, each of them trained for a specific intent. Once the intent is extracted by the intent classifier, the corresponding slot filler is used to extract slots from the query. While CRFs are a standard approach for slot filling, we note that more computationally demanding approaches based on deep learning models have been recently proposed . Our experiments showed that these approaches do not yield any significant gain in accuracy in the typical training size regime of custom voice assistants (a few hundred queries). Data sparsity is addressed by integrating features based on precomputed word clusters, obtained by clustering word embeddings computed on a large independent corpus, effectively reducing the vocabulary size from typically 50K words to a few hundred clusters. Finally, gazetteer features are used, based on parsers built from the slot values provided in the training data. Consistently with the n-gram slot models in the LM (see section 3.1), these parsers can match partial slot values. When injecting personal user data (see section 3.1), these gazetteer parsers are augmented accordingly to cover the new slot values. The NLU component has been benchmarked and proven to be competitive against various commercial solutions .
5 Numerical Results
In this section, we present an end-to-end evaluation of both our SLU system and a cloud-based commercial solution, on two domains of increasing complexity posing different challenges. We start by detailing our data collection protocol. In the interest of reproducibility, the datasets used in the following are made publicly available (see section 1.1). The trained SLU models can be obtained through the Snips web console at no cost for non-commercial use. In our comparison with Google’s cloud services, we used the service’s built-in slots and features whenever possible in the interest of fairness. For all experiments, we fix our threshold for OOV detection to , the pattern transducer is a bigram model, while the corresponding to the gazetteer-based slots are trigrams (see section 3.1 for definitions of these quantities).
5.1 Datasets and experimental setting
Our datasets contain up to a few thousand text queries with their supervision, i.e. intent and slots, collected using an in-house data generation pipeline described in . We then crowdsource the recording of these sentences and collect one spoken utterance for each text query in the dataset. Far-field datasets are created by playing these utterances with a neutral speaker and record them using a microphone array positioned at a distance of 2 meters. The aim of a SLU system is then, given one such spoken utterance, to predict the ground-true intent (intent classification) and slots. We measure the performance of our SLU system in terms of F1-score on intent classification, and percentage of perfectly parsed utterances, such that both intent and slots are recovered.
We first consider a small assistant typical of smart home use cases, the “SmartLights” assistant, comprising intents allowing to turn on or off the light, or change its brightness or color. It has a vocabulary size of approximately 400 words, and depends on three slots (room, brightness and color). Table 3 shows that we reach an accuracy similar to a commercial, cloud-based SLU solution. Our SLU system for this assistant has a total size of 15.1MB and runs in real time on a Raspberry Pi 3. We then turn to a large and complex assistant allowing to control a smart speaker through playback control (volume control, track navigation, etc), but also play music from large libraries of artists, tracks and albums. In order to allow some variability in the pronunciation of artist names or music work, we generate up to 3 pronunciations per word using a statistical G2P. In addition to the English version of the assistant, we also consider a French version which presents the additional difficulty of handling the pronunciations of many English words in French. We compute cross-language pronunciations for these words by generating up to
variations using a statistical English G2P, and then mapping their phonemes to the closest ones in the French phonology. The vocabulary of the resulting English music assistant contains more than 65k words, corresponding to 178k pronunciations, while the French assistant has more than 70k words, with 390k pronunciations. These assistants are the largest we consider, with a total size on disk of 80MB for the English version, and 112MB for the French version. They run in real time on a Raspberry Pi 3. We test these assistants on utterances of the form “play some music by #ARTIST”, where we sample “#ARTIST” from a publicly available list of the most streamed artists on Spotify. This experiment is representative of the difficulty of this SLU task, and additionally allows to estimate the performance of ASR systems as a function of the popularity of artists. To this end, we consider two sets of experiments. In the first, we perform inference using a full Snips SLU engine and compute the fraction of correctly parsed utterances. In a second experiment, we replace Snips ASR by Google’s Speech Recognition API. We find (see table4) that the performance of cloud-based, general-purpose solutions such as Google’s ASR decay rapidly with the ranking of the artist. By contrast, our class-based approach outlined in section 3.1 assigns similar weights to all artists, resulting in more robust performance even for less popular artists. Additionally, in practice, our SLU system can incorporate user-specific tastes through value injection (see section 3.2), e.g. by connecting privately to a user’s favorite streaming service.
SLU on the edge can achieve the accuracy of cloud-based solutions without compromising on user privacy. Future work includes further reducing the size of our models to run them on even smaller devices, and leveraging private federated learning to improve performance using in-domain data.
-  Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig, “Achieving human parity in conversational speech recognition,” arXiv preprint arXiv:1610.05256, 2016.
-  Mahadev Satyanarayanan, “The emergence of edge computing,” Computer, vol. 50, no. 1, pp. 30–39, 2017.
-  Snips Team, “Snips NLU, Snips Python library to extract meaning from text,” GitHub repository, https://github.com/snipsco/snips-nlu, 2018.
-  Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez Arenas, Kanishka Rao, David Rybach, Ouais Alsharif, Hasim Sak, Alexander Gruenstein, Françoise Beaufays, et al., “Personalized speech recognition on mobile devices,” arXiv preprint arXiv:1603.03185, 2016.
-  Ye-Yi Wang, Alex Acero, and Ciprian Chelba, “Is word error rate a good indicator for spoken language understanding accuracy,” in Automatic Speech Recognition and Understanding, 2003. ASRU’03. 2003 IEEE Workshop on. IEEE, 2003, pp. 577–582.
-  Pete Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
-  Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011, number EPFL-CONF-192584.
-  Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming Wang, and Sanjeev Khudanpur, “Purely sequence-trained neural networks for asr based on lattice-free mmi.,” in Interspeech, 2016, pp. 2751–2755.
-  Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5206–5210.
-  Mehryar Mohri, Fernando Pereira, and Michael Riley, “Weighted finite-state transducers in speech recognition,” Departmental Papers (CIS), p. 11, 2001.
-  Slava Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” IEEE transactions on acoustics, speech, and signal processing, vol. 35, no. 3, pp. 400–401, 1987.
-  Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai, “Class-based n-gram models of natural language,” Computational linguistics, vol. 18, no. 4, pp. 467–479, 1992.
-  Axel Horndasch, Caroline Kaufhold, and Elmar Nöth, “How to add word classes to the kaldi speech recognition toolkit,” in International Conference on Text, Speech, and Dialogue. Springer, 2016, pp. 486–494.
-  Haihua Xu, Daniel Povey, Lidia Mangu, and Jie Zhu, “Minimum bayes risk decoding and system combination based on a recursion for edit distance,” Computer Speech & Language, vol. 25, no. 4, pp. 802–828, 2011.
-  Cyril Allauzen, Michael Riley, and Johan Schalkwyk, “Filters for efficient composition of weighted finite-state transducers,” in International Conference on Implementation and Application of Automata. Springer, 2010, pp. 28–38.
-  Cyril Allauzen, Michael Riley, and Johan Schalkwyk, “A generalized composition algorithm for weighted finite-state transducers,” in Tenth Annual Conference of the International Speech Communication Association, 2009.
-  John Lafferty, Andrew McCallum, and Fernando CN Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” 2001.
-  Ye-Yi Wang and Alex Acero, “Discriminative models for spoken language understanding,” in Ninth International Conference on Spoken Language Processing, 2006.
Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek
Hakkani-Tur, Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu, et al.,
“Using recurrent neural networks for slot filling in spoken language understanding,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 3, pp. 530–539, 2015.
-  Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, et al., “Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces,” arXiv preprint arXiv:1805.10190, 2018.