ROBIN111http://aimas.cs.pub.ro/robin/en/ is a user-centred project aiming to develop software and services for human interaction with robots within a digital interconnected society. Its focus is on several types of robots: assistive ones - targeting users with special needs (people with some medical problems or the elderly), robots for interaction with clients and software robots that can be installed on vehicles with the aim of (semi)autonomous driving. One of the objectives of the ROBIN-Dialog component project222http://aimas.cs.pub.ro/robin/en/robin-dialog/ was the creation of necessary Romanian language resources and processing tools for making a robot able to communicate with users in tasks defined within several micro-worlds. One example of micro-world is given by the interaction within the notebooks department of an electronics store. This micro-world is made up of the physical space occupied by this department, by the notebooks that are commercialized by that store, their characteristics on the basis of which customers decide what products they buy, their availability, the provisional date for their becoming available, the robot and the customers who interact with the robot for finding the notebook they want to purchase or for finding the right configuration for their needs. Acting as a shop assistant, the robot must be aware of the products the department commercializes, their availability, their characteristics, as well as the types of usage scenarios they are adequate for (e.g. notebooks for gaming, for design, programming, etc.).
described the natural language processing pipeline being used, as well as the dialogue manager for micro-worlds. Furthermore, and  presented a low-latency automated speech recognition (ASR) system developed and used within the ROBIN project. This paper introduces a new speech corpus recorded for the purposes of improving the performance of the ASR system and further of the entire pipeline. The paper is structured as follows: in Section II we present related work, including other available Romanian speech resources, in Section III the corpus acquisition process is described, Section IV contains relevant corpus statistics, while in Section V we consider the impact of the new corpus being used in the ROBIN project. We conclude the paper in Section VI.
Ii Related work
|Corpus||Speech Type||Domain||# Hours||# Utterances||# Speakers|
|RSS ||Spontaneous||Internet, TV||5.5||5.7k||3|
Compared to better resourced languages, such as English, speech resources available for the Romanian language are reduced in number. The representative corpus of the contemporary Romanian language (CoRoLa) contains a spoken component that can be interrogated via the OCQP platform333http://corolaws.racai.ro/corola_sound_search/index.php . Currently, it contains professional recordings from various sources (radio stations, recording studios), broadcast news and extracts from Romanian Wikipedia read by non-professionals (recorded in non-professional environments). In the context of the ReTeRom project444https://www.racai.ro/p/reterom/, the CoBiLiRo platform was built to allow gathering of additional bimodal corpora, one of the final goals being to enrich the CoRoLa corpus.
The Read Speech Corpus (RSC) contains 100 hours collected from 164 native speakers, mainly students and faculty staff, with an age average of 24 years. The sentences were selected from novels, online news and from a list of words that covered all the possible syllables in Romanian.
The RoDigits corpus contains 37.5 hours of spoken connected digits from 154 speakers whose ages vary between 20 and 45. Each speaker recorded 100 clips of 12 randomly generated Romanian digits, and after the semi-automated validation, the final corpus contains 15,389 audio files.
SWARA  is a corpus that comprises speech data collected from 17 speakers which was manually segmented at the utterance-level, resulting in a dataset of approximately 21 hours of transcribed speech, split into over 19,000 audio-text pairs.
The RO-GRID  dataset was developed by reading sequences of six words chosen from a list of alternatives. The first three words were designated as ”keywords” and the speaker had to utter all combinations, which ended up being 400 ones. The last three words were designated as ”fillers” and were randomly chosen while creating the sentence. The final corpus contained 6.6 hours of audio from 12 speakers.
The Romanian Speech Synthesis (RSS)  corpus was designed for speech synthesis and it contains 4 hours of speech from a single female speaker using multiple microphones. The speaker read 4,000 sentences chosen for diphtone coverage, that were extracted from novels and newspapers and fairy-tales. RSS was also extended with over 1,700 utterances from two new female speakers, comprising now 5.5 hours of speech.
Romanian Anonymous Speech Corpus (RASC)  is a dataset that applied the concept of crowd-sourcing to collect Romanian spoken data from the general population, by developing an open interactive platform. The corpus currently contains 4.8 hours of transcribed audio.
The Common Voice (CV) 
corpus is a massively multilingual dataset of transcribed speech. At the moment of this writing, the Romanian version contains 9 hours of transcribed audio (6 hours validated) recorded by 130 speakers, using sentences from the Romanian Wikipedia.
VoxPopuli  is a large-scale multilingual corpus that contains 100,000 hours of raw audios in 23 languages and 1,800 hours of transcribed speech in 16 languages. One of the languages found in the corpus is Romanian, with 4,500 hours of unlabelled speech and 83 hours of transcribed audio.
Multilingual corpus of Sentence-aligned Spoken utterances (MaSS)  is a speech dataset based on readings of the Bible. The dataset contains 8,130 of parallel spoken utterances in eight languages, thus also allowing construction of end-to-end speech translation systems. The Romanian version contains 23 hours of spoken data.
Table I summarizes the statistics of the publicly available Romanian speech corpora presented above.
Iii Corpus acquisition
The ROBINTASC corpus was collected at RACAI, during the year 2020, as part of the ROBIN project. It was recorded by a number of 6 speakers of different genders (3 males and 3 females) and ages. For recording purposes, the RELATE  platform was extended to allow for audio files to be stored, recorded and listened to.
The audio processing component is activated if a corpus is created within the platform by specifying that it contains audio files. This enables all bimodal processing features. Since we start with text sentences for which we aim to provide recordings, the first step is to upload the associated texts. These can be uploaded either as separate text files or as a single CSV file containing each sentence on a different line. In the last case, the platform allows specifying the column containing the text as well as CSV characteristics such as headers, column separators, enclosing characters and optional characters indicating comments (lines to be skipped).
Recordings are stored as WAV files with a sample rate of 44.1 KHz using 16-bit signed integers. The recording component has a PHP back-end allowing it to store the files in the bimodal corpus, together with the associated text. In order to allow multiple speakers to record the same sentences, the file name incorporates the speaker pseudonym, thus creating unique file names for each of the speakers. Furthermore, in case of text uploaded as CSV files, the file name contains also the line number from the corresponding CSV file.
We did not use a “studio” environment for performing the recording. Instead, each speaker used his/her own hardware (headphones or dedicated microphone) to make the recordings. At any time after a sentence is recorded, the speaker (or another person given access to the corpus) can listen to the recording, download the associated WAV file and, if there were issues detected during recording (i.e., there was an unwanted noise or the speaker realises the pronunciation was not correct), delete it. The deletion of a recording will cause the associated sentence to re-appear in the recording component. This enables the speakers to re-record the sentences.
After all the sentences were recorded, as part of the packaging process, the text was annotated using UDPipe as integrated in the RELATE platform. This provides linguistic annotations such as part-of-speech (using both universal part-of-speech tags555https://universaldependencies.org/u/pos/ and language-dependent MSD tags), lemmatization and dependency relations. No phonetic transcription is made. The resulting annotations are stored in tab-separated CoNLL-U666https://universaldependencies.org/format.html files.
Finally, a script was created to gather all the generated files (raw text, text annotations, sound recordings), anonymize the speakers, add metadata and create a single archive with the corpus. Text file names use the pattern Sn.txt where n is the sentence number (starting with 0 and ending with 710). Corresponding annotation files use the pattern Sn.conllu. Sound files use the pattern Sn_s.wav, where n continues to represent the sentence number and s represents the speaker number (from 1 to 6).
A metadata file was generated with corpus and speaker characteristics, including number of sentences, total duration, speaker’s gender and age, number of recorded files by each speaker, information about recording device used. In order to anonymize the corpus, speaker’s age is given only as intervals (for example ”40-50” years).
Iv Corpus statistics
Statistics were computed at all levels: audio files, raw text and annotated text. For text-related statistics, the RELATE platform was used, while audio information was extracted using the soxi utility from the SoX (Sound eXchange)777http://sox.sourceforge.net/ software package. Audio statistics are given in Table II and text-related statistics are given in Table III.
|Number of WAV files||3786|
|Encoding||Signed Int16 PCM|
|Number of text files||711|
|Total text size||57Kb|
|Maximum file size||122b|
|Minimum file size||3b|
|Average file size||81.8b|
|Number of tokens||11,927|
The smallest text and the corresponding smallest audio recording, as indicated in the statistics tables, are associated with the simple interaction ”Pa!” (”Bye!”). An example from an average sized text file is: “Care e cel mai scump leptop acer, cu placă grafică dedicată tesla pe o sută și opt gigabaiți ram?” (“Which is the most expensive ACER laptop, with dedicated Tesla P100 graphical board and 8 gigabytes of RAM?”). The text in Romanian is written having in mind the pronunciation of English words and not their written form. Furthermore, numbers are written explicitly using words.
The most frequent lemmas (given in Table IV) show that most of the sentences are focused around the acquisition of laptops. Notice that the word for computer memory (”ram”) appears in over half of the sentences. Even though the text corpus is rather small, the number of hapax legomena (words appearing only once) is rather reduced (only 58 words are hapax legomena as indicated in Table III).
|Tag||Occurrences||# Unq. Lemmas|
The most frequent part-of-speech tags are presented in Table V. Nouns and adjectives are the most frequent ones. This answers the need of having computer parts with different characteristics covered by the corpus. Also the numerals are the fifth most frequent tags, corresponding to the different quantities associated with computer parts present in text.
The lexical diversity of the corpus is given by the number of unique lemmas and their proportion in the whole number of occurrences for each part of speech. As one can notice in the last column of Table V, the corpus is not very lexically diverse, as our aim was to capture a variety of ways in which relevant terms in this micro-world are pronounced.
Speaker related statistics are presented in Table VI. This includes the gender, age group and the number of recordings.
V Corpus usage within the ROBIN project
The primary reason behind the construction of the ROBINTASC corpus was the improvement of the ROBIN project’s components involved in the micro-world scenario associated with a human-robot interaction in a computer store. The following sub-sections present an overview of the influence of this corpus on the software components: ASR and dialogue manager.
V-a Automatic Speech Recognition
, 4 Long Short-Term Memory cells (LSTM)
of 768 neurons, 1 look-ahead layer and 1 dense layer on top of which the softmax function was applied to create the output distribution over the possible characters. The ROBINTASC fine-tuned version of the baseline ASR system started with the baseline weights and completed the fine-tuning using the training part of the ROBINTASC corpus. The KenLM language model used to correct the transcriptions was also modified in the fine-tuned version of the ASR to better mimic the ROBINTASC words distribution, by multiplying each sentence 10 times in the text part of the training portion of ROBINTASC. The text replication step was performed in order to use an already existing automatic processing pipeline. This is not a limitation of the model itself which could have been adjusted using the model’s weights instead of replicating the text.
The transcription performance was assessed on a test corpus that contains new sentences pronounced by one female and one male voice that also recorded samples in the ROBINTASC training part with the speaker id 5 (F5-test) and 1 (M1-test), respectively, together with a new male voice (M-new). It is known that WER (and CER) are better on sampled test corpora than on unseen data set , containing voices that did not participate in the recording of the training data. Thus, we wanted to evaluate the close to real-world performance usage of the fine-tuned ASR system versus the baseline version.
The test corpus contains 50 questions that were designed to stress-test the ability of the fine-tuned ASR to adapt to the computer store domain. These sentences contain computer hardware-related companies that were found in ROBINTASC (e.g. Intel, CUDA, NVIDIA, etc.), but also new company names (e.g. Nokia, Siriux, etc.) or device names (e.g. ”smart phone”). All English words have been phonetically transcribed to Romanian, following the design principles of ROBINTASC, in order to see if the ASR system can learn English pronunciations (e.g. ”smart făun/fon” for ”smart phone”).
We evaluated both the baseline ASR system and the ROBINTASC fine-tuned ASR system on the test corpus. The results of the two versions are outlined in Table VII. It can be observed that the fine-tuning process improved the performance of the model for all three speakers, improving the average WER by 16.3% and the average CER by 7.8%. The highest and the lowest improvements are obtained on the female voice from train (F5-test), 24.16% WER, and on the male voice from train (M1-test), 10.34% WER respectively. The performance on the new male voice (M-new) was enhanced by 14.33% WER with fine-tuning.
Looking at the generated transcriptions, we can explain some of the errors in the following way:
some clitics were not properly transcribed: ”haș pe -ul” vs. ”haș pe ul”, ”care-l” vs. ”care -l” or ”care îl” (see also the discussion on the treatment of clitics in );
one word is sometimes recognized as two consecutive words: ”ultra portabil” instead of ”ultraportabil”;
some of the English terminology in the test corpus has more than one possible phonetical transcriptions and in ROBINTASC all of these have been used: ”uindous”, ”uindos” or ”uindăus” for the English ”Windows”;
in general, new English phonetically transcribed terminology is not properly recognized: ”ol in oane” (”All in one”) vs. ”oli man” or ”linăx cent ău es” (”Linux CentOS”) vs. ”linăx centes”.
Other reasons for the high WER values, for both the baseline and fine-tuned models, can be attributed to the different recording conditions and the amount of data used to train the models.
V-B Dialogue manager
The ROBIN Dialogue Manager (RDM, ) is a Java-based dialogue manager that works with micro-worlds. A micro-world is a set of definitions of spoken-about concepts, predicates that hold among them, ASR and TTS systems that work well in the micro-world and any other piece of information that would make an autonomous system (e.g. a robot) handle specific tasks in the micro-world. In the case of the notebook department of an electronics store micro-world, the robot should be able to give technical details and pricing for the existing stock of laptops.
RDM has been designed to work on the Pepper robot888https://www.softbankrobotics.com/emea/en/pepper, enabling it to listen and respond to users’ questions in Romanian. When enough information has been gathered through the conversation, RDM can supply predefined action items to the robot’s planning algorithm., e.g. “Let me find out if your laptop is in stock.”
We have empirically evaluated RDM with the baseline and fine-tuned ASR systems, by asking it different questions, appropriate to the electronics store micro-world. While we do not have a quantitative evaluation on how much better the fine-tuned ASR system is, it was significantly better than the baseline ASR system mainly because English terminology was not handled at all by the baseline model but was handled acceptably well by the fine-tuned ASR system, as long as the English terms were in the ROBINTASC corpus, e.g. ”leptop” (English ”laptop”), ”haș pe” (”HP”), ”gigabaiți” (”GB”), ”epăl” (”Apple”), etc. This is an indication that the fine-tuned ASR can be further improved with new English terms, should the need arise.
This paper introduced ROBINTASC, a new Romanian language speech corpus from the ROBIN project. We have shown that it had a positive influence on two components developed within the ROBIN project, namely an ASR system based on Deep Speech 2 architecture and a dialogue manager, developed for micro-world scenarios. The corpus is open sourced, available under a Creative Commons Attribution NonCommercial NoDerivatives (CC BY-NC-ND) 4.0 license999https://creativecommons.org/licenses/by-nc-nd/4.0/ and can be downloaded from the Zenodo platform101010https://doi.org/10.5281/zenodo.4626540.
The research described in this article was supported by a grant of the Romanian National Authority for Scientific Research and Innovation, CNCS – UEFISCDI, project number PN-III 72PCCDI ⁄ 2018, ROBIN – “Roboții și Societatea: Sisteme Cognitive pentru Roboți Personali și Vehicule Autonome”.
-  D. Tufiș, V. Barbu Mititelu, E. Irimia, M. Mitrofan, R. Ion and G. Cioroiu, ”Making Pepper Understand and Respond in Romanian”, 2019 22nd International Conference on Control Systems and Computer Science (CSCS), Bucharest, Romania, 2019, pp. 682-688, doi: 10.1109/CSCS.2019.00122.
-  R. Ion, V. Badea, G. Cioroiu, V. Barbu Mititelu, E. Irimia, M. Mitrofan, and D. Tufiș, ”A Dialog Manager for Micro-Worlds”, Studies in Informatics and Control, 2020, 29(4), pp. 411-420.
A.M. Avram, V. Păiș, D. Tufiș, ”Towards a Romanian end-to-end automatic speech recognition based on DeepSpeech2”, 2020, Proc. Ro. Acad., Series A, Volume 21, No. 4, pp. 395-402.
-  A.M. Avram, V. Păiș, D. Tufiș, ”Romanian speech recognition experiments from the ROBIN project”, Proceedings of the 15th International Conference Linguistic Resources and Tools for Natural Language Processing (CONSILR), 2020, pp. 103-114.
-  D. Tufiș, V. Barbu Mititelu, E. Irimia, V. Păiș, R. Ion, N. Diewald, M. Mitrofan, M. Onofrei, ”Little strokes fell great oaks. Creating CoRoLa, the reference corpus of contemporary Romanian”, Revue Roumaine de linguistique, 2019, LXIV (3).
-  T. Boroș, Ș. Dumitrescu, and V. Păiș, ”Tools and resources for Romanian text-to-speech and speech-to-text applications”, Proceedings of the International Conference on Human-Computer Interaction – RoCHI 2018, pp 46-53.
-  D. Cristea, I. Pistol, Ș. Boghiu, A.D. Bibiri, D. Gîfu, A. Scutelnicu, M. Onofrei, D. Trandabăț, G. Buceag, ”CoBiLiRo: A Research Platform for Bimodal Corpora”, Proceedings of the 1st International Workshop on Language Technology Platforms (IWLTP 2020), pp. 22–27, Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020.
-  A. Stan, J. Yamagishi, S. King, M. Aylett, ”The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate”, Speech Communication, 2011, pp. 442-450.
-  A.L. Georgescu, H. Cucu, A. Buzo and C. Burileanu, ”RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition”, Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 2020, pp. 6606-6612.
-  A.L. Georgescu, A. Caranica, H. Cucu and C. Burileanu, ”Rodigits-A Romanian Connected-Digits Speech Corpus For Automatic Speech And Speaker Recognition”, University Politehnica of Bucharest Scientific Bulletin, Series C, 2018, Vol. 80, Iss. 3, pp. 45-62.
-  R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F.M. Tyers, and G. Weber, ”Common voice: A massively-multilingual speech corpus”, arXiv:1912.06670, 2019.
-  V. Păiș, R. Ion, D. Tufiș, ”A Processing Platform Relating Data and Tools for Romanian Language”, Proceedings of the 1st International Workshop on Language Technology Platforms, European Language Resources Association, 2020, pp. 81-88.
-  M. Straka, J. Hajic, J. Straková, ”UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing”, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC), Portorož, Slovenia, 2016.
-  V. Păiș, ”Multiple annotation pipelines within the RELATE platform”, Proceedings of the 15th International Conference Linguistic Resources and Tools for Natural Language Processing (CONSILR), 2020, pp. 65-75.
-  C. Wang, M. Rivière, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, E. Dupoux, ”VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation”, arXiv preprint arXiv:2101.00390s.
D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, J. Chen, ”Deep speech 2: End-to-end speech recognition in english and mandarin”. Proceedings of the 33rd International Conference on Machine Learning (PMLR), 2016, pp. 173-182.
-  Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, ”Gradient-based learning applied to document recognition”, Proceedings of the IEEE, 1998, pp. 2278-324.
-  S. Hochreiter and J. Schmidhuber, ”Long short-term memory”. Proceedings of Neural computation, 1997, pp. 1735-1780.
-  A. Stan, F. Dinescu, C. Ţiple, Ș. Meza, B. Orza, M. Chirilă, M. Giurgiu, ”The SWARA speech corpus: A large parallel Romanian read speech dataset”. Proceedings of the 9th International Conference on Speech Technology and Human-Computer Dialogue (SpeD), 2017, pp. 1-6.
-  R. Ion, V. G. Badea, G. Cioroiu, V. Barbu Mititelu, E. Irimia, M. Mitrofan, D. Tufiș, ”A Dialog Manager for Micro-Worlds”. Studies in Informatics and Control, ISSN 1220-1766, vol. 29(4), pp. 411-420, 2020. https://doi.org/10.24846/v29i4y202003
-  T. Likhomanenko, Q. Xu, V. Pratap, P. Tomasello, J. Kahn, G. Avidov, R. Collobert, G. Synnaeve, ”Rethinking Evaluation in ASR: Are Our Models Robust Enough?”. arXiv:2010.11745v3 [cs.LG]
-  M. Z. Boito, W. N. Havard, M. Garnerin, E. L. Ferrand, L. Besacier, ”Mass: A large and clean multilingual corpus of sentence-aligned spoken utterances extracted from the bible”, Proceedings of The 12th Language Resources and Evaluation Conference (LREC), 2020, pp. 6486-6493.
-  S. D. Dumitrescu, T. Boroș, R. Ion. ”Crowd-sourced, automatic speech-corpora collection–Building the Romanian Anonymous Speech Corpus”, CCURL 2014: Collaboration and Computing for Under-Resourced Languages in the Linked Open Data Era, 2014, pp. 90-94.
-  A. Kabir, M. Giurgiu, ”A romanian corpus for speech perception and automatic speech recognition”. Proceedings of The 10th International Conference on Signal Processing, Robotics and Automation, 2011, pp. 323-327.
C. Manolache, A.-L. Georgescu, V. Barbu Mititelu, H. Cucu and C. Burileanu, “Improved Text Normalization and Language Models for SpeeD’s Automatic Speech Recognition System”. Proceedings of the 15th International Conference “Linguistic Resources and Tools for Natural Language Processing”, online, 14-15 December 2020, Editura Universității A. I. Cuza, Iași, 2020, pp. 115-128.