Spoken language understanding (SLU) systems are fundamental blocks when building interactive technologies for new languages. A typical SLU system consists of an Automatic Speech Recognition (ASR) module followed by a Natural Language Understanding (NLU) module. ASR converts speech to textual transcriptions and the NLU module performs downstream tasks like intent recognition and slot filling from the transcripts obtained. However, building high fidelity ASR systems requires a large amount labelled data which is usually not available for most languages. Language specific ASR system thus forms a bottleneck for creating SLU systems for low-resourced languages. To circumvent this, phonetics based SLU systems have been proposed where the need for language specific ASR is bypassed by typically using a universal phone decoder. This allows creation of language and task specific, word-free, NLU modules that perform intent recognition directly from phonetic transcriptions.
In this paper, we show that our proposed method coupled with Allo embeddings performs competitively with current state of the art SLU systems on English language, and we report SOTA on Sinhala and Tamil. We work with natural speech datasets in three languages - English, Sinhala and Tamil. Our contributions are as follows: (i) We present a 1-D dilated CNN based method and show that the model when combined with Allo embeddings outperforms the previous approaches that employ phonetic transcriptions (ii) We study the effects of different embeddings (explained in Section 3) on the performance of the system i.e., - (a) Phone, (b) Panphone and, (iii) Allo embeddings and (3) We study how the performance with Allo embeddings scale with the number of training examples per intent.
2 Related Works
Intent recognition has been traditionally performed using textual transcripts generated by ASR systems. Since building ASR technologies is not viable for most languages, recent work has focused on creating such systems using alternate methods. In , authors use spectral features of input speech such as MFCCs for intent recognition. NLU modules have also been built for low resourced languages using outputs of an English ASR system, for example, using the softmax outputs of DeepSpeech . DeepSpeech is a character level model where the softmax outputs corresponding to the model vocabulary were used as inputs to the intent classification model . Similarly, softmax outputs of an English phoneme recognition system  have also been used to build intent recognition systems for Sinhala and Tamil .
In this section we define our proposed method and the different input embeddings used for all the experiments. We propose an end-to-end 1D dilated convolution neural network (CNN) as shown in Figure1
. The network consists of 4 CNN-BatchNorm-ReLU-Dropout layers. The CNN layer are dilated in an increasing linear order from first to fourth layeri.e.,
1 dilation in the first layer to 4 dilation in the fourth layer. We apply dilation to increase the overall context. Furthermore we also pad the input to avoid any down-sampling in time dimension. This setup is followed by an average pool and a dropout layeri.e., we map the variable input time steps to a fixed number of time steps, which is
. Lastly we add a linear layer to get the output probability distribution over the number of intents.
Let be the raw audio signal(input) and be the intent (output). In the first step we map the input to some high-level features using the Allosaurus tool 111https://github.com/xinjli/allosaurus . Particularly we extract the output sequence of phones  and the last layer weights corresponding to each sample . Given this information we define three different embeddings for our proposed method.
Phone (): Similar to the previous work, an embedding layer is learnt during the training step such that it maps the individual phones to a 256-dimensional features.
Panphone (): We map the individual phone units to a 26-dimensional features based on the work by .
Allo (): To the best of our knowledge this is a first work to use the pre-trained 640-dimensional last layer weights of Allosaurus as embeddings for the intent classification task. We call it Allo embeddings.
From now on we will use the word embedding and input interchangeably.
|Language||Number of Utterances||Number of Speakers||Number of Intents|
In this study we experiment with three different languages i.e., English, Sinhala and, Tamil with varying training and test sizes. We classify each of the three languages in high, medium and, low resource settings respectively. The complete statistics are shown in Table 1. For English, we use the largest freely available Fluent Speech Commands (FSC) dataset . The dataset has 248 unique sentences spoken by 97 speakers. Furthermore the authors also divide the dataset into train, development and, test sets such that there is no overlap of speakers between train and test. Similar to , we use the 31-class intent classification formulation of this dataset.
Sinhala and Tamil datasets are of banking domain collected via crowd-sourcing. Both the datasets have the 6-class intents. Similar to the previous work , we evaluate our models using 5-fold cross-validation technique [8, 1] since there is no train, development and, test splits provided by the authors.
5 Experimental Setup
We train and evaluate our proposed model on three different languages of varying dataset sizes as explained in Section 4. We fix the number of layers to and experiment with 5 different configuration of kernel sizes with varying dilation as shown in Table 2. Furthermore we map each experiment with its context size since it is more intuitive to reason using context information. Therefore we will use context size to differentiate the experiments instead of kernel size and dilation. We experiment with three different embeddings as explained in Section 3. Note that out of these three only the Phone embedding () is trainable.
In all experiments, we use L2 weight decay, and dropout as regularization and Adam as an optimizer with 0.0015 learning rate. The evaluation metrics we report is accuracy. All the reported results are the average of 5 runs with different random seed. Detailed training configurations and the code is available on GitHub222Will be made available on GitHub upon acceptance..
|%||kernel sizes||Dilation rate||Context size|
6 Experimental Results
In this section we report results to show why CNN is a better architecture choice than LSTM and Transformers in low resource settings i.e.,. We compare our proposed method with previous work  using Phone embeddings. Secondly, we compare Phone, Panphone and, Allo embeddings with increasing context size. Lastly we also study the effect of number of training example when using these three embeddings.
Based on the results, in almost all the settings we recommend to use Allo embeddings for the intent classification task. We observe that choosing a bigger context size is a better choice across all the experiments.
6.1 Comparison With Previous Work Using Phone Embedding
Similar to , we train our proposed method on all the three languages with the phone embeddings. Our proposed method performs better on all the three languages when compared to the similar training settings using LSTM by . The accuracy gap increases as the dataset size decreases as shown in Table 3. The results prove that CNN is a better choice for intent classification task in low resource settings. For Tamil we report a new SOTA accuracy and for Sinhala we achieve near SOTA accuracy when using the phone embeddings .
6.2 Comparison Of Three Different Embeddings
Allo embeddings gives the best accuracy over all the three languages compared to Phone and Panphone. With Allo embeddings our method achieves near perfect accuracy on all the three languages and SOTA on Sinhala and Tamil as shown in Table and Figure 2. It has to be noted that compared to some of the earlier works [4, 6], our proposed method works exceptionally well in case of medium and low resource languages i.e., Sinhala and Tamil.
Interestingly we observe that in case of Tamil, a low resource language in our current setup, Allo embeddings does not provide significant gains in accuracy compared to English and Sinhala. We wanted to test if the cause for this behaviour is the small training dataset or the language itself. Therefore we sample data from Sinhala of similar size to Tamil and repeat the same experiments. As shown in Figure 3, we observe a similar pattern as before across all the three embeddings i.e., for Tamil as shown in Figure 2. Therefore the training dataset size plays an important role when using Allo embeddings. Finally , Allo embeddings does not provide significant accuracy gains on small training dataset. But at the same time, it is the best choice compared to Phone and Panphone embeddings.
6.3 Allo Embedding Performance VS Number Of Training Examples
Given our previous observations with Tamil, we were interested in the correlation between the number of training examples and the model behaviour i.e., number of training example vs accuracy. Therefore we scale the training dataset size such that *split is the number of training examples, where is the number of intents. For example in case of Sinhala language we have equal to 6 and if the value of split is 32, this would give us 192 training examples. We vary the value of split from 32 to 512 such that the number of training examples ranges to 192 to 3072 as shown in Figure 3.
We experiment on Sinhala to test the trend of accuracy gains on different embeddings with the increase in number of training examples. As shown in Figure 4 we observe that the performance of Allo embeddings is proportional to the number of training examples and saturates after a certain point. Furthermore with only 192 training examples (split=32) Allo perform on par with Phone Panphone embeddings given that the model has a higher context size.
7 Conclusion and future work
In this work we proposed a new CNN based method which achieves new SOTA accuracy in medium and low resource settings i.e, for Sinhala and Tamil language respectively. Furthermore we also propose to use a new Allo embedding for the intent classification task which achieves near perfect accuracy across all the three languages. We observe that Allo embedding performs on par with Phone and Panphone embedding in very low resource settings i.e., at split . Similarly to , our proposed method can also be extended to do slot identification task. For the future work we would like to explore how to make the Allo embeddings work better in very low resource settings i.e., work towards achieving few shot learning.
-  (2018) Domain specific intent classification of sinhala speech data. In 2018 International Conference on Asian Language Processing (IALP), pp. 197–202. Cited by: §2, Table 1, §4.
-  (2021) Acoustics based intent recognition using discovered phonetic units for low resource languages. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7453–7457. Cited by: §2.
-  (2020) Mere account mein kitna balance hai?–on building voice enabled banking services for multilingual communities. arXiv preprint arXiv:2010.16411. Cited by: §2.
-  (2021) Intent recognition and unsupervised slot identification for low resourced spoken dialog systems. arXiv preprint arXiv:2104.01287. Cited by: §2, 1st item, §4, §4, §6.1, §6.2, Table 3, §6, §7.
-  (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §2.
-  (2019) Sinhala and tamil speech intent identification from english phoneme based asr. In 2019 International Conference on Asian Language Processing (IALP), pp. 234–239. Cited by: §2, §6.2, Table 3.
-  (2019) Transfer learning based free-form speech command classification for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 288–294. Cited by: §2.
-  (2019) Transfer learning based free-form speech command classification for low-resource languages. In Proceedings of the 57th Annual Meeting of the ACL: Student Research Workshop, pp. 288–294. Cited by: Table 1, §4.
-  (2020) Universal phone recognition with a multilingual allophone system. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 8249–8253. External Links: Cited by: §2, §3.
-  (2019) Speech model pre-training for end-to-end spoken language understanding. arXiv preprint arXiv:1904.03670. Cited by: §2, Table 1, §4.
-  (2014) PHOIBLE online. Cited by: §3.
PanPhon: a resource for mapping IPA segments to articulatory feature vectors. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 3475–3484. External Links: Cited by: 2nd item.
-  (2021) Speech-language pre-training for end-to-end spoken language understanding. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7458–7462. Cited by: Table 3.