Log In Sign Up

Intent Classification Using Pre-Trained Embeddings For Low Resource Languages

Building Spoken Language Understanding (SLU) systems that do not rely on language specific Automatic Speech Recognition (ASR) is an important yet less explored problem in language processing. In this paper, we present a comparative study aimed at employing a pre-trained acoustic model to perform SLU in low resource scenarios. Specifically, we use three different embeddings extracted using Allosaurus, a pre-trained universal phone decoder: (1) Phone (2) Panphone, and (3) Allo embeddings. These embeddings are then used in identifying the spoken intent. We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios. Our system improves on the state-of-the-art (SOTA) intent classification accuracy by approximately 2.11 for Sinhala and 7.00 Furthermore, we present a quantitative analysis of how the performance scales with the number of training examples used per intent.


page 1

page 2

page 3

page 4


On Building Spoken Language Understanding Systems for Low Resourced Languages

Spoken dialog systems are slowly becoming and integral part of the human...

Acoustics Based Intent Recognition Using Discovered Phonetic Units for Low Resource Languages

With recent advancements in language technologies, humansare now interac...

A character representation enhanced on-device Intent Classification

Intent classification is an important task in natural language understan...

End-to-End Natural Language Understanding Pipeline for Bangla Conversational Agents

Chatbots are intelligent software built to be used as a replacement for ...

Pre-training for low resource speech-to-intent applications

Designing a speech-to-intent (S2I) agent which maps the users' spoken co...

AlloST: Low-resource Speech Translation without Source Transcription

The end-to-end architecture has made promising progress in speech transl...

1 Introduction

Spoken language understanding (SLU) systems are fundamental blocks when building interactive technologies for new languages. A typical SLU system consists of an Automatic Speech Recognition (ASR) module followed by a Natural Language Understanding (NLU) module. ASR converts speech to textual transcriptions and the NLU module performs downstream tasks like intent recognition and slot filling from the transcripts obtained. However, building high fidelity ASR systems requires a large amount labelled data which is usually not available for most languages. Language specific ASR system thus forms a bottleneck for creating SLU systems for low-resourced languages. To circumvent this, phonetics based SLU systems have been proposed where the need for language specific ASR is bypassed by typically using a universal phone decoder. This allows creation of language and task specific, word-free, NLU modules that perform intent recognition directly from phonetic transcriptions.

In this paper, we show that our proposed method coupled with Allo embeddings performs competitively with current state of the art SLU systems on English language, and we report SOTA on Sinhala and Tamil. We work with natural speech datasets in three languages - English, Sinhala and Tamil. Our contributions are as follows: (i) We present a 1-D dilated CNN based method and show that the model when combined with Allo embeddings outperforms the previous approaches that employ phonetic transcriptions (ii) We study the effects of different embeddings (explained in Section 3) on the performance of the system i.e., - (a) Phone, (b) Panphone and, (iii) Allo embeddings and (3) We study how the performance with Allo embeddings scale with the number of training examples per intent.

2 Related Works

Intent recognition has been traditionally performed using textual transcripts generated by ASR systems. Since building ASR technologies is not viable for most languages, recent work has focused on creating such systems using alternate methods. In [1], authors use spectral features of input speech such as MFCCs for intent recognition. NLU modules have also been built for low resourced languages using outputs of an English ASR system, for example, using the softmax outputs of DeepSpeech [5]. DeepSpeech is a character level model where the softmax outputs corresponding to the model vocabulary were used as inputs to the intent classification model [7]. Similarly, softmax outputs of an English phoneme recognition system [10] have also been used to build intent recognition systems for Sinhala and Tamil [6].

On the other hand,[3][2][4] proposed to build NLU module using phones extracted from Allosaurus [9]

. Allosaurus is a universal phone recognizer and therefore language independent. A prototypical naive-bayes intent classifier was built using Allosaurus phonetic transcriptions as inputs in

[3]. [2] was the first extensive work on using phonetic transcriptions for intent classification on multiple low resourced languages from two language families - Romance and Indic languages. [4] was the first study on building intent recognition systems for natural speech and achieved state-of-the-art results on Tamil. Yet their work was unable to achieve competitive results for languages like English or Sinhala with larger amounts of data. Building on the work of [4], we propose to use Allosaurus to extract a sequence of dense representation instead of the sequence of discrete phones given an audio file as explained in Section 3. Using this, we are able to achieve close to perfect performance on English and significantly push the SOTA on Sinhala and Tamil.

3 Methodology

Figure 1: Our proposed Method with three different embeddings as an input. A block having a black shadow means the parameters are trainable.

In this section we define our proposed method and the different input embeddings used for all the experiments. We propose an end-to-end 1D dilated convolution neural network (CNN) as shown in Figure


. The network consists of 4 CNN-BatchNorm-ReLU-Dropout layers. The CNN layer are dilated in an increasing linear order from first to fourth layer


1 dilation in the first layer to 4 dilation in the fourth layer. We apply dilation to increase the overall context. Furthermore we also pad the input to avoid any down-sampling in time dimension. This setup is followed by an average pool and a dropout layer

i.e., we map the variable input time steps to a fixed number of time steps, which is

. Lastly we add a linear layer to get the output probability distribution over the number of intents.

Let be the raw audio signal(input) and be the intent (output). In the first step we map the input to some high-level features using the Allosaurus tool 111 [9]. Particularly we extract the output sequence of phones [11] and the last layer weights corresponding to each sample . Given this information we define three different embeddings for our proposed method.

  • Phone (): Similar to the previous work[4], an embedding layer is learnt during the training step such that it maps the individual phones to a 256-dimensional features.

  • Panphone (): We map the individual phone units to a 26-dimensional features based on the work by [12].

  • Allo (): To the best of our knowledge this is a first work to use the pre-trained 640-dimensional last layer weights of Allosaurus as embeddings for the intent classification task. We call it Allo embeddings.

From now on we will use the word embedding and input interchangeably.

4 Dataset

Figure 2: Comparing the accuracy of different embedding with increasing context size on English, Sinhala and, Tamil language.
Language Number of Utterances Number of Speakers Number of Intents
English [10] 30,043 97 31
Sinhala [1] 7624 215 6
Tamil [8] 400 40 6
Table 1: Dataset statistics for English, Sinhala and Tamil datasets.

In this study we experiment with three different languages i.e., English, Sinhala and, Tamil with varying training and test sizes. We classify each of the three languages in high, medium and, low resource settings respectively. The complete statistics are shown in Table 1. For English, we use the largest freely available Fluent Speech Commands (FSC) dataset [10]. The dataset has 248 unique sentences spoken by 97 speakers. Furthermore the authors also divide the dataset into train, development and, test sets such that there is no overlap of speakers between train and test. Similar to [4], we use the 31-class intent classification formulation of this dataset.

Sinhala[1] and Tamil[8] datasets are of banking domain collected via crowd-sourcing. Both the datasets have the 6-class intents. Similar to the previous work [4], we evaluate our models using 5-fold cross-validation technique [8, 1] since there is no train, development and, test splits provided by the authors.

Figure 3: Results on Sinhala when the training size is similar to Tamil.

5 Experimental Setup

We train and evaluate our proposed model on three different languages of varying dataset sizes as explained in Section 4. We fix the number of layers to and experiment with 5 different configuration of kernel sizes with varying dilation as shown in Table 2. Furthermore we map each experiment with its context size since it is more intuitive to reason using context information. Therefore we will use context size to differentiate the experiments instead of kernel size and dilation. We experiment with three different embeddings as explained in Section 3. Note that out of these three only the Phone embedding () is trainable.

In all experiments, we use L2 weight decay, and dropout as regularization and Adam as an optimizer with 0.0015 learning rate. The evaluation metrics we report is accuracy. All the reported results are the average of 5 runs with different random seed. Detailed training configurations and the code is available on GitHub

222Will be made available on GitHub upon acceptance..

% kernel sizes Dilation rate Context size
C1 1-1-1-1 1-1-1-1 1
C2 3-3-3-3 1-1-1-1 9
C3 3-3-3-3 1-2-3-4 17
C4 3-5-7-9 1-1-1-1 21
C5 3-5-7-9 1-2-3-4 41
Table 2: The table shows 5 different training configurations. Hyphens separate the 4 CNN layers such that 3-5-7-9 means the architecture has a kernel size of 3,5,7,9 for 1,2,3,4 layer number respectively. We also compute the overall context size for an experiment for an easier comparison between the different experiments.

6 Experimental Results

In this section we report results to show why CNN is a better architecture choice than LSTM and Transformers in low resource settings i.e.,. We compare our proposed method with previous work [4] using Phone embeddings. Secondly, we compare Phone, Panphone and, Allo embeddings with increasing context size. Lastly we also study the effect of number of training example when using these three embeddings.

Based on the results, in almost all the settings we recommend to use Allo embeddings for the intent classification task. We observe that choosing a bigger context size is a better choice across all the experiments.

Language Accuracy [4] Our Method
English 99.71% [13] 92.77% 92.99%
Sinhala 97.31% [6] 96.33% 97.05%
Tamil 81.7% [6] 91.50% 97.25%
Table 3: Results when using Phone embeddings. Our method results are based on the experiment number C as shown in Table 1.

6.1 Comparison With Previous Work Using Phone Embedding

Similar to [4], we train our proposed method on all the three languages with the phone embeddings. Our proposed method performs better on all the three languages when compared to the similar training settings using LSTM by [4]. The accuracy gap increases as the dataset size decreases as shown in Table 3. The results prove that CNN is a better choice for intent classification task in low resource settings. For Tamil we report a new SOTA accuracy and for Sinhala we achieve near SOTA accuracy when using the phone embeddings .

Figure 4: Test accuracy across context size and embeddings as we increase the number for training examples per intent.

6.2 Comparison Of Three Different Embeddings

Allo embeddings gives the best accuracy over all the three languages compared to Phone and Panphone. With Allo embeddings our method achieves near perfect accuracy on all the three languages and SOTA on Sinhala and Tamil as shown in Table and Figure 2. It has to be noted that compared to some of the earlier works [4, 6], our proposed method works exceptionally well in case of medium and low resource languages i.e., Sinhala and Tamil.

Interestingly we observe that in case of Tamil, a low resource language in our current setup, Allo embeddings does not provide significant gains in accuracy compared to English and Sinhala. We wanted to test if the cause for this behaviour is the small training dataset or the language itself. Therefore we sample data from Sinhala of similar size to Tamil and repeat the same experiments. As shown in Figure 3, we observe a similar pattern as before across all the three embeddings i.e., for Tamil as shown in Figure 2. Therefore the training dataset size plays an important role when using Allo embeddings. Finally , Allo embeddings does not provide significant accuracy gains on small training dataset. But at the same time, it is the best choice compared to Phone and Panphone embeddings.

6.3 Allo Embedding Performance VS Number Of Training Examples

Given our previous observations with Tamil, we were interested in the correlation between the number of training examples and the model behaviour i.e., number of training example vs accuracy. Therefore we scale the training dataset size such that *split is the number of training examples, where is the number of intents. For example in case of Sinhala language we have equal to 6 and if the value of split is 32, this would give us 192 training examples. We vary the value of split from 32 to 512 such that the number of training examples ranges to 192 to 3072 as shown in Figure 3.

We experiment on Sinhala to test the trend of accuracy gains on different embeddings with the increase in number of training examples. As shown in Figure 4 we observe that the performance of Allo embeddings is proportional to the number of training examples and saturates after a certain point. Furthermore with only 192 training examples (split=32) Allo perform on par with Phone Panphone embeddings given that the model has a higher context size.

Language Phone Panphone Allo
English 92.99% 92.96% 99.08%
Sinhala 97.05% 97.36% 99.42%
Tamil 97.25% 97.75% 98.50%
Table 4: Comparison across different embeddings for each language.Results are based on the experiment number 5 as shown in Table 1 i.e., the biggest context size.

7 Conclusion and future work

In this work we proposed a new CNN based method which achieves new SOTA accuracy in medium and low resource settings i.e, for Sinhala and Tamil language respectively. Furthermore we also propose to use a new Allo embedding for the intent classification task which achieves near perfect accuracy across all the three languages. We observe that Allo embedding performs on par with Phone and Panphone embedding in very low resource settings i.e., at split . Similarly to [4], our proposed method can also be extended to do slot identification task. For the future work we would like to explore how to make the Allo embeddings work better in very low resource settings i.e., work towards achieving few shot learning.


  • [1] D. Buddhika, R. Liyadipita, S. Nadeeshan, H. Witharana, S. Javasena, and U. Thayasivam (2018) Domain specific intent classification of sinhala speech data. In 2018 International Conference on Asian Language Processing (IALP), pp. 197–202. Cited by: §2, Table 1, §4.
  • [2] A. Gupta, X. Li, S. K. Rallabandi, and A. W. Black (2021) Acoustics based intent recognition using discovered phonetic units for low resource languages. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7453–7457. Cited by: §2.
  • [3] A. Gupta, S. K. Rallabandi, and A. W. Black (2020) Mere account mein kitna balance hai?–on building voice enabled banking services for multilingual communities. arXiv preprint arXiv:2010.16411. Cited by: §2.
  • [4] A. Gupta, S. K. Rallabandi, and A. W. Black (2021) Intent recognition and unsupervised slot identification for low resourced spoken dialog systems. arXiv preprint arXiv:2104.01287. Cited by: §2, 1st item, §4, §4, §6.1, §6.2, Table 3, §6, §7.
  • [5] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §2.
  • [6] Y. Karunanayake, U. Thayasivam, and S. Ranathunga (2019) Sinhala and tamil speech intent identification from english phoneme based asr. In 2019 International Conference on Asian Language Processing (IALP), pp. 234–239. Cited by: §2, §6.2, Table 3.
  • [7] Y. Karunanayake, U. Thayasivam, and S. Ranathunga (2019) Transfer learning based free-form speech command classification for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 288–294. Cited by: §2.
  • [8] Y. Karunanayake, U. Thayasivam, and S. Ranathunga (2019) Transfer learning based free-form speech command classification for low-resource languages. In Proceedings of the 57th Annual Meeting of the ACL: Student Research Workshop, pp. 288–294. Cited by: Table 1, §4.
  • [9] X. Li, S. Dalmia, J. Li, M. Lee, P. Littell, J. Yao, A. Anastasopoulos, D. R. Mortensen, G. Neubig, A. W. Black, and F. Metze (2020) Universal phone recognition with a multilingual allophone system. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 8249–8253. External Links: Document Cited by: §2, §3.
  • [10] L. Lugosch, M. Ravanelli, P. Ignoto, V. S. Tomar, and Y. Bengio (2019) Speech model pre-training for end-to-end spoken language understanding. arXiv preprint arXiv:1904.03670. Cited by: §2, Table 1, §4.
  • [11] S. Moran, D. McCloy, and R. Wright (2014) PHOIBLE online. Cited by: §3.
  • [12] D. R. Mortensen, P. Littell, A. Bharadwaj, K. Goyal, C. Dyer, and L. Levin (2016-12)

    PanPhon: a resource for mapping IPA segments to articulatory feature vectors

    In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 3475–3484. External Links: Link Cited by: 2nd item.
  • [13] Y. Qian, X. Bianv, Y. Shi, N. Kanda, L. Shen, Z. Xiao, and M. Zeng (2021) Speech-language pre-training for end-to-end spoken language understanding. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7458–7462. Cited by: Table 3.