Kurdish language processing requires endeavor by interested researchers and scholars to overcome with a large gap which it has regarding the resource scarcity. The areas that need attention and the efforts required have been addressed in [Hassani2018].
The Kurdish speech recognition is an area which has not been studied so far. We were not able to retrieve any resources in the literature regarding this subject.
In this paper, we present a dataset based on CMUShpinx [CMUSphinx2019] for Sorani Kurdish. We call it a Dataset for Sorani Kurdish Automatic Speech Recognition (BD-4SK-ASR). Although other technologies are emerging, CMUShpinx could still be used for experimental studies.
2 Related work
The work on Automatic Speech Recognition (ASR) has a long history, but we could not retrieve any literature on Kurdish ASR at the time of compiling this article. However, the literature on ASR for different languages is resourceful. Also, researchers have widely used CMUSphinx for ASR though other technologies have been emerging in recent years [CMUSphinx2019].
We decided to use CMUSphinx because we found it a proper and well-established environment to start Kurdish ASR.
3 The BD-4SK-ASR Dataset
To develop the dataset, we extracted 200 sentences from Sorani Kurdish books of grades one to three of the primary school in the Kurdistan Region of Iraq. We randomly created 2000 sentences from the extracted sentences.
In the following sections, we present the available items in the dataset. The dataset ia available on https://github.com/KurdishBLARK/BD-4SK-ASR.
The phoneset includes 34 phones for Sorani Kurdish. A sample of the file content is given below.
Figure 1 shows the Sorani letters in Persian-Arabic script, the suggested phoneme (capital English letters), and an example of the transformation of words in the developed corpus.
3.2 Filler phones
The filler phone file usually contains fillers in spoken sentences. In our basic sentences, we have only considered silence. Therefore it only includes three lines to indicate the possible pauses at the beginning and end of the sentences and also after each word.
3.3 The File IDs
This file includes the list of files in which the narrated sentences have been recorded. The recorded files are in wav formats. However, in the file IDs, the extension is omitted. A sample of the file content is given below. The test directory is the directory in which the files are located.
3.4 The Transcription
This file contains the transcription of each sentence based on the phoneset along with the file ID in which the equivalent narration has been saved. The following is a sample of the content of the file.
<s> BYR RRAAMAAN DAARISTAANA AMAANAY </s> (T1-1-50-18)
<s> DWWRA HAWLER CHIRAAYA SARDAAN NABWW </s> (T1-1-50-19)
<s> SAALL DYWAAR QWTAABXAANA NACHIN </s> (T1-1-50-20)
<s> XWENDIN ANDAAMAANY GASHA </s> (T1-1-50-21)
<s> NAMAAM WRYAA KIRD PSHWWDAA </s> (T1-1-50-22)
<s> DARCHWWY DAKAN DAKAWET </s> (T1-1-50-23)
<s> CHAND BIRAAT MAQAST </s> (T1-1-50-24)
<s> BAAXCHAKAY DAAYK DARCHWWY </s> (T1-1-50-25)
<s> RROZH JWAAN DAKAWET ZYAANYAAN </s> (T1-1-50-26)
3.5 The Corpus
The corpus includes 2000 sentences. Theses sentence are random renderings of 200 sentences, which we have taken from Sorani Kurdish books of the grades one to three of the primary school in the Kurdistan Region of Iraq. The reason that we have taken only 200 sentences is to have a smaller dictionary and also to increase the repetition of each word in the narrated speech. We transformed the corpus sentences, which are in Persian-Arabic script, into the format which complies with the suggested phones for the related Sorani letters (see Section 3.4).
3.6 The Narration Files
Two thousand narration files were created. We used Audacity222https://www.audacityteam.org/ to record the narrations. We used a normal laptop in a quiet room and minimized the background noise. However, we could not manage to avoid the noise of the fan of the laptop. A single speaker narrated the 2000 sentences, which took several days. We set the Audacity software to have a sampling rate of 16, 16-bit bit rate, and a mono (single) channel. The noise reduction db was set to 6, the sensitivity to 4.00, and the frequency smoothing to 0.
3.7 The Language Model
We created the language from the transcriptions. The model was created using CMUSphinx in which (fixed) discount mass is 0.5, and backoffs are computed using the ratio method. The model includes 283 unigrams, 5337 bigrams, and 6935 trigrams.
We presented a dataset, BD-4SK-ASR, that could be used in training and developing an acoustic model for Automatic Speech Recognition in CMUSphinx environment for Sorani Kurdish. The Kurdish books of grades one to three of primary schools in the Kurdistan Region of Iraq were used to extract 200 sample sentences. The dataset includes the dictionary, the phoneset, the transcriptions of the corpus sentences using the suggested phones, the recorded narrations of the sentences, and the acoustic model. The dataset could be used to start experiments on Sorani Kurdish ASR.
As it was mentioned before, research and development on Kurdish ASR require a huge amount of effort. A variety of areas must be explored, and various resources must be collected and developed. The multi-dialect characteristic of Kurdish makes these tasks rather demanding. To participate in these efforts, we are interested in the expansion of Kurdish ASR by developing a larger dataset based on larger Sorani corpora, working on the other Kurdish dialects, and using new environments for ASR such as Kaldi333https://kaldi-asr.org/.