DeepAI
Log In Sign Up

CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese

10/14/2021
by   Arnaldo Candido Junior, et al.
0

Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. In particular, for the Brazilian Portuguese (BP) language, there were about 376 hours public available for ASR task until the second half of 2020. With the release of new datasets in early 2021, this number increased to 574 hours. The existing resources, however, are composed of audios containing only read and prepared speech. There is a lack of datasets including spontaneous speech, which are essential in different ASR applications. This paper presents CORAA (Corpus of Annotated Audios) v1. with 290.77 hours, a publicly available dataset for ASR in BP containing validated pairs (audio-transcription). CORAA also contains European Portuguese audios (4.69 hours). We also present a public ASR model based on Wav2Vec 2.0 XLSR-53 and fine-tuned over CORAA. Our model achieved a Word Error Rate of 24.18 set. When measuring the Character Error Rate, we obtained 11.02 CORAA and Common Voice, respectively. CORAA corpora were assembled to both improve ASR models in BP with phenomena from spontaneous speech and motivate young researchers to start their studies on ASR for Portuguese. All the corpora are publicly available at https://github.com/nilc-nlp/CORAA under the CC BY-NC-ND 4.0 license.

READ FULL TEXT VIEW PDF
02/09/2021

BembaSpeech: A Speech Recognition Corpus for the Bemba Language

We present a preprocessed, ready-to-use automatic speech recognition cor...
04/20/2020

ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers

Automatic speech recognition (ASR) via call is essential for various app...
08/13/2020

MASRI-HEADSET: A Maltese Corpus for Speech Recognition

Maltese, the national language of Malta, is spoken by approximately 500,...
03/30/2022

Is Word Error Rate a good evaluation metric for Speech Recognition in Indic Languages?

We propose a new method for the calculation of error rates in Automatic ...
11/18/2021

Towards Measuring Fairness in Speech Recognition: Casual Conversations Dataset Transcriptions

It is well known that many machine learning systems demonstrate bias tow...
03/29/2022

Earnings-22: A Practical Benchmark for Accents in the Wild

Modern automatic speech recognition (ASR) systems have achieved superhum...
04/07/2022

3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

Recently, Conformer based CTC/AED model has become a mainstream architec...

Code Repositories

1 Introduction

Automatic Speech Recognition (ASR) is a complex and challenging. Significant progress in techniques, models for the task had occurred in recent years. The main reasons for this progress include (but are not limited to) the availability of large-scale datasets and advances in deep learning methods running over powerful computing platforms.

Despite significant advances in ASR benchmarking solutions, the main and large datasets available for training and evaluating ASR systems are English due to the predominance of the language in science and business, although there are some current efforts to build multilingual speech corpora [ardila-etal-2020-common, wang-etal-2020-covost, wang2020covost, Pratap_2020]. Another problem is the environment of the recording, mostly composed of clean speech. Regarding the style of speaking, they are read speech, such as [Panayotov2015LibrispeechAA, Pratap_2020, ardila-etal-2020-common, wang-etal-2020-covost, zanonboito-EtAl:2020:LREC] or prepared speech like [HernandezNGTE18, salesky2021mTeDx].

In this paper, we focus on a specific language – the Brazilian Portuguese (BP) –, which was struggling with only a few dozen hours of public data available until the middle of 2020. The previous open dataset to train speech models in BP were much smaller than American English datasets, with only 10 hours for speech synthesis (TTS)111https://github.com/Edresson/TTS-Portuguese-Corpus and 60 hours for ASR. The resource commonly used to train ASR models for BP is an ensemble of four small, non-conversational speech datasets: the Common Voice Corpus version 5.1 (Mozilla)222https://commonvoice.mozilla.org/pt/datasets, Sid dataset333https://doi.org/10.17771/PUCRio.acad.8372, VoxForge444http://www.voxforge.org/pt/downloads, and LapsBM1.4555https://laps.ufpa.brfalabrasil/.

In the second half of 2020, three new datasets were made available: (i) the BRSD v2 which includes the CETUC dataset [Alencar_2008] (with almost 145 hours), plus 12 hours and 30 minutes of non-conversational speech from 3 small open datasets666BRSD v2 also includes CSLU: Spoltech Brazilian Portuguese Version 1.0 — https://catalog.ldc.upenn.edu/LDC2006S16 [Macedo_Quintanilha_2020], (ii) the Multilingual LibriSpeech (MLS), derived from reading LibriVox audiobooks in 8 languages, including BP [Pratap_2020] with 169 hours, and (iii) the dataset Common Voice version 6.1777https://commonvoice.mozilla.org/pt/datasets [ardila-etal-2020-common], with 50 validated hours, composed of recordings of read sentences which were displayed on the screen. These three datasets total 376 hours. Given this recent public availability of large audio databases for BP language, the lack of resources has been gradually reduced, although it is still far from ideal when compared to resources for the English language.

In early 2021, a new dataset with prepared speech, called the Multilingual TEDx Corpus [salesky2021mTeDx], was made publicly available, providing 765 hours to support speech recognition and speech translation research. The Multilingual TEDx Corpus is composed by a collection of audio recordings from TEDx talks in 8 source languages, including 164 hours of Portuguese. Moreover, a new version of the dataset Common Voice (Common Voice Corpus 7.0) was launched with 84 validated hours, which is an increment of 34 hours over the previous version. Therefore, currently, BP language is well represented with 574 hours of speech data which can be used to train new ASR models.

However, there is still a lack of datasets with audio files that record spontaneous speech of various genres, from interviews to informal dialogues and conversations, i.e., conversational speech recorded in natural contexts and noisy environments to train robust ASR systems. Spontaneous speech presents several phenomena such as laughter, coughs, filled pauses, word fragments resulted from repetitions, restarts and revisions of the discourse. This gap makes difficult the development of both high-quality dialog systems and automatic speech recognition systems capable of handling spontaneous speech recorded in noisy environments. The latter ones are called rich transcription-style ASR (RT-ASR) when they explicitly convert those phenomena cited above into special tokens [inaguma17_interspeech, Fujimura2018SimultaneousSR, tanaka21c_interspeech]. Dialog systems, for example, must deal with several types of speech disfluencies, preserving them instead of removing filled pauses and word fragments [BaumannKHS16]. In general, it is expected that ASR systems trained on read style and clean speech will face a drop of performance when dealing with informal conversations in contexts of free interactions and noisy environments.

The TaRSila project is an effort of the Center for Artificial Intelligence

888http://c4ai.inova.usp.br/pt/nlp2-pt/

(C4AI) to make available language resources to bring natural language processing of BP to the state-of-the-art. The project aims at growing speech datasets for BP language, to achieve state-of-the-art results for automatic speech recognition, multi-speaker synthesis, speaker identification, and voice cloning. In a joint effort of two research centers, the C4AI and the CEIA

999http://centrodeia.org/ (Center of Excellence in Artificial Intelligence, in English), four speech corpora composed of prepared, guided interviews and spontaneous speech from academic projects were manually validated to serve as an ASR benchmark for BP. The projects are: (i) ALIP [Goncalves_2019]; (ii) C-ORAL Brasil I [raso_mello_2012]; (iii) Nurc-Recife [Oliveira_2016]; and (iv) SP2010 [MENDES_OUSHIRO_2012]. We also validated 76.36 hours of prepared speech from a collection of TEDx Talks101010https://www.ted.com/ in Brazilian Portuguese, including 4.69 hours of European Portuguese, to allow experiments with Portuguese language variants.

1.1 Goals

In this paper we present a new public available dataset called CORAA (Corpus of Annotated Audios) v1. CORAA has 290.77 hours of validated pairs (audio-transcription) and is composed by five corpora: ALIP [Goncalves_2019], C-ORAL Brasil I [raso_mello_2012], NURC-Recife [Oliveira_2016], SP2010 [MENDES_OUSHIRO_2012], TEDx Portuguese talks. Information about each corpora is presented in Table 1

. The original sizes of each dataset in hours are presented as reported in their respective original papers, when reported by the authors. Regarding SP2010, the total duration is estimated, since the authors report 60 recordings from 60 to 70 minutes each and the total hours of ALIP was computed after download. All the corpora are publicly available

111111Currently, only the test set is not available, because it will be released after an ASR Challenge involving CORAA at https://github.com/nilc-nlp/CORAA under the CC BY-NC-ND 4.0 license. These corpora were assembled with the purpose of improving ASR models in BP with phenomena from spontaneous speech and noise in order to motivate young researchers in this exciting research area.

Corpus ALIP
C-ORAL
Brasil I
NURC
Recife
SP2010
TEDx
Portuguese
Speech
Genres
Interviews,
Dialogues
Monologues,
Dialogues,
Conversations
Dialogues,
Interviews,
Conference
and Class
Talks
Conversations,
Interviews,
Reading
Stage
Talks
Speaking
Styles
Spontaneous
Speech
Spont.
Speech
Spont. and
Prepared
Speech
Spont. and
Read Speech
Prepared
Speech
Accent
São Paulo
State Cities
Minas Gerais Recife
São Paulo
Capital
Misc.
Original
(hrs)
78 21.13 279 65 249
Table 1: Speech Genres, Accents, Speaking Styles and Hours (in decimal) in each original CORAA Corpora

As an example of the feasibility of speech recognition with CORAA, we present one speech recognition experiment using the Wav2vec 2.0 XLSR-53 [conneau2020unsupervised, baevski2020wav2vec]. Furthermore, we compared our model with the state of the art in automatic speech recognition in Brazilian Portuguese[gris2021brazilian]. This two models are evaluated according to three main scenarios: (a) testing audios with different characteristics from training; (b) focusing on model performance for each of the five corpora, considering noise level and accent; (c) analyzing spontaneous and prepared speech styles impacts on the trained models.

1.2 Highligths

The main contributions made in this work are summarised as follows.

  1. A large BP corpus of validated pairs (audio-transcription) containing 290.77 hours, composed of five corpora (ALIP, C-ORAL Brasil I, NURC Recife, SP2010, and TEDx Portuguese talks), adapted for the task of ASR in BP. We also include 4.69 hours of European Portuguese (in TEDx Portuguese).

  2. The first corpus, according to our knowledge, tackling spontaneous speech for ASR in BP.

  3. An ASR Model, publicly available, using the presented corpus.

Section 2 details both related work on datasets available for ASR in BP and the five spoken corpora projects used in CORAA v1. Section 3 describes the steps followed in preparing the CORAA corpus. Section 4 presents the creation of train, development and test splits of CORAA, the experiment on ASR for BP and an error analysis of our model. Finally, Section 5 presents the final remarks of the work.

2 Related Works on Speech Datasets and Spoken Corpora for BP

2.1 Open Datasets for Speech Recognition in BP

Three new datasets were released for BP at the end of 2020. CETUC dataset [Alencar_2008] contains 145 hours of 100 speakers, half males, and half females. The sentence set is composed of 1,000 sentences (3,528 words). The sentences are phonetically balanced and extracted from CETEN-Folha121212https://www.linguateca.pt/cetenfolha/index_info.html corpus. Each speaker uttered all sentences from the sentence set exactly once. CETUC was recorded in a controlled environment, using a sample rate of 16kHz. The audios are publicly available131313https://igormq.github.io/datasets/, without an explicit license. Regarding the environment of recording and speaking style, CETUC delivers clean and read speech.

Common Voice Corpus 6.1, version pt_63h_2020-12-11, contains 63 hours of audio, 50 of which were considered validated. The dataset comprises 1,120 BP speakers, 81% males and 3% females (some audios are not sex labeled). The audios were collected using the Common Voice website141414https://commonvoice.mozilla.org/pt/speak or using a mobile APP. The speakers read aloud sentences presented on the screen. A maximum of 3 contributors analyzed each pair audio-transcription, and simple voting is applied: two votes for acceptance validate the audio; two votes for rejection invalidate the audio. A given release may also contain samples that were analyzed but did not receive enough votes to be validated/invalidated — these samples have the status “OTHER” [ardila-etal-2020-common]. Releases are distributed under the CC-0151515https://commonvoice.mozilla.org/pt/datasets license and contain MP3 files, originally collected at 48kHz sampling rate but downsampled to 16kHz. The following metadata is also available: ID_speaker, path_audio_file, read_sentence, up_votes, down_votes, age, gender, accent. Where up_votes e down_votes refers to the voting result, and the last three fields are optional. Regarding the speaking style, Common Voice Corpus has read speech. As for recording environment, both noise level and sound clarity is very heterogeneous. The current version of the dataset (Common Voice Corpus 7.0) has 84 validated hours, 34 hours more than version 6.1.

The Multilingual LibriSpeech (MLS) dataset [Pratap_2020] is composed by audios extracted from Librivox161616https://librivox.org/ audiobooks. The Librivox project releases audiobooks in the public domain. MLS dataset encompasses eight languages, including BP, and is released under the CC BY 4.0171717http://www.openslr.org/94/ license. MLS can be used for developing both ASR and TTS models. There are 160.96 hours for training models, 3.64 hours for tuning and 3.74 for testing for Portuguese. It provides 26 male and 16 female speakers in the training dataset; 5 female, and 5 male speakers for tuning; and the same for testing. The audios were downsampled from 48kHz to 16kHz for easy processing. Regarding the environment of the recording and speaking style, MLS is made of clean and read speech.

In early 2021, a new dataset was made publicly available — the Multilingual TEDx Corpus, licensed under the CC BY-NC-ND 4.0181818www.openslr.org/100. This dataset has recordings of TEDx talks in 8 languages, BP being one of them, represented with 164 hours and 93K sentences. Each TEDx talk is stored as a 44 or 48kHz sampled wav file. Available metadata include source language, talk title, speaker name, audio length, keywords, and a short talk description. Multilingual TEDx Corpus was built to advance ASR and speech translation research, with multilingual models and baseline models being distributed for ASR and speech translation. Regarding the speaking style and the environment of the recording, Multilingual TEDx Corpus is composed of prepared and clean speech.

2.2 Spoken corpora projects used in CORAA

2.2.1 Alip

The project ALIP191919https://www.alip.ibilce.unesp.br/ (Amostra Linguística do Interior Paulista – Language Sample of the Interior of São Paulo, in English) [Goncalves_2019] was proposed in , and coordinated by Prof. Sebastião Carlos Leite Gonçalves, from UNESP São José do Rio Preto. This project was responsible for building the database called Iboruna [IBORUNA], composed of two types of speech samples:

  • A sample of interviews (each with about minutes, being male and female voices) from the northwest region of the São Paulo state;

  • Another sample consisting of dialogues, involving from two to five informants. It was secretly recorded in contexts of free social interactions. This sample has 28 informants (10 men and 18 women).

This corpus totals hours and it is characterized by the spontaneous speech of the linguistic variety of Brazilian Portuguese spoken in the interior of São Paulo. It was compiled between the years of and . The informants, residents of different cities, range in age from to over years, with a considerable variety of income and education.

The speech samples were recorded with GamaPower and PowerPack digital recorders. For interviews, the consent of the informants was obtained before recording, while, for the dialogues, dialogues, the consent was obtained after recording. The interviewer conducted the interviews, and the dialogues were free, with topics defined by the participant interactions.

The corpus is available for academic use without a defined license, but with defined Terms of Use and Privacy Policy202020https://www.alip.ibilce.unesp.br/termos-de-uso. It is available via download from the project website. The two types of samples have a dedicated folder for each, in the following formats. Each folder contains .mp3 files (the audios are sampled in 8kHz), as well as .doc and .pdf files (transcriptions, informant’s socio-demographic information, among others). It is important to note that audio files are not aligned with their transcriptions.

2.2.2 C-ORAL Brasil I

C-ORAL Brasil I is a corpus published in , resulting of the project C-ORAL Brasil212121http://www.c-oral-brasil.org/ under the coordination of Tommaso Raso and Heliana Mello, from the Faculty of Arts of the Federal University of Minas Gerais [Raso_Mello_Mittmann_2015, raso-etal-2012, raso_mello_2012]. This synchronic corpus was recorded between 2008 and 2011 and is composed of informal and spontaneous speech, representative of the linguistic variation in Minas Gerais, especially in the city of Belo Horizonte.

It is composed of 139 recording sessions (or texts), totaling hours and words, averaging 1,500 words per text. C-ORAL Brasil I has 362 informants. There is a balance regarding number of uttered words: 50.36% words are uttered by 159 males and 49.64% words by 203 females.

Its content is divided into private-family (about of the corpus) and public () contexts. In addition, there is a separation of interaction types by number of participants: monologues (amounting to about of recordings), dialogues and conversations, i.e. more than two active participants (about of recordings).

The speech flow was segmented into tonal units and terminal units according to the prosodic criterion, based on the Language Into Act Theory (L-AcT) [CRESTI2018] which designates the utterance as the reference unit of speech. The boundary between tonal units results from a prosodic break with a non-conclusive value, while the boundary between terminal units corresponds to the perception of a prosodic break with a conclusive value.

In order to obtain a great diaphasic diversity, i.e, according to the communicative context, the project brought a remarkable variety of communicative contexts, compiling scenarios such as communication between players in a football match, the preparation of a drag queen for a presentation, a conversation between a realtor and a client, among others. In addition, a considerable balance was reached regarding the demographic criterion concerning the informants’ education and gender. There are 362 informants in the corpus, 138 from the city of Belo Horizonte, 89 from other cities in Minas Gerais, and the rest from other states, countries, or of unknown origin.

There was an effort to use high-quality acoustic equipment at the time. The project used PMD660 Marantz digital recorders and Sennheiser Evolution EW100 G2 wireless kits. It also used non-invasive “clip-on” microphones to create a more natural environment, essential for recording high diaphasic variation in spontaneous speech.

C-ORAL Brasil I is available via download from the project website in raw format, morphosyntactically annotated by the Parser Palavras [bick2000palavras], in addition to metadata. The C-Oral-Brasil I corpus is licensed under CC BY-NC-SA 4.0. The following files are of special interest for this work: (i) audio in .wav format, with a sampling rate of 48kHz, transcription in .rtf and .txt formats, audio-transcription alignment in XML format generated by the software WinPitch222222https://www.winpitch.com/.

2.2.3 NURC-Recife

The NURC-Recife corpus has its origins in the NURC (Norma Urbana Oral Culta) project, which documents the spoken language in five Brazilian capitals: Recife, Salvador, Rio de Janeiro, São Paulo and Porto Alegre. NURC-Recife corresponds to the part referring to the linguistic variety spoken in the city of Recife. The corpus is available on the website of the NURC Digital project232323https://fale.ufal.br/projeto/nurcdigital/, developed between -. The project NURC Digital, coordinated by Prof. Miguel Oliveira Jr. of the Federal University of Alagoas (UFAL), was responsible for processing, organizing and releasing the data of the NURC-Recife project in digital form [Oliveira_2016].

The project is comprised of hours spread over recordings (called inquiry in the project) obtained between the years of and . In fact, this value would be the total duration in hours if all audios and their transcriptions were available on the website. An analysis of all audio-transcription pairs raised one inquiry lacking audio and transcription and 11 inquiries lacking transcriptions, resulting 279 hours available.

The recordings follow NURC guidelines and are categorized as follows:

  • Formal utterances (EF), consisting of 37 recordings of lectures and talks given by one speaker;

  • Dialogues between two informants (D2) conducted by a mediator, with 71 recordings;

  • Dialogues between an informant and an interviewer (DID), with 238 recordings.

The informant ages range from to over years, all of them with higher education and initially selected with equal division (originally -) for male and female voices.

The environment of the recordings varied, depending on the type of inquiry: specific rooms, classrooms, auditoriums or even in the informants’ homes. It also has very heterogeneous noise levels and sound clarity, whether from the equipment used, the recording environment or deterioration of the physical material.

The original recordings were captured with omnidirectional dynamic microphones with table support. The reel-to-reel tape recorders used were: AKAI 4000 DS Mk–II, SONY TC–366, and Philips N 4416, the first being the most frequent. The audios were recorded on professional reel magnetic tapes, mm thick, mm wide, and m long (BASF TP 18 LH). However, within the scope of the NURC Digital project, they were digitized following the recommendations of the Open Archival Information System (OAIS), in the ISO standard (), with a sampling rate of kHz and quantization of bits. For this digitization, were used the software Audacity, Audiofile Specter, the AKAI 4000 DS Mk–II reel-to-reel recorder, a USB Audio Interface Sound Devices USBPre 2, and the RCA Diamond Cable JX-2055.

NURC Digital is available for academic use, without a defined license, via download from the project website, which allows a search by recording year (1974 to 1988), recording topic, and type of inquiry (D2, DID, and EF). There is also information about the age range of the informants, gender, and audio quality. Within each inquiry folder there are: (i) the digitized version of the specific recording (metadata), in .pdf format; (ii) a file in textgrid format, containing the audio timestamps with the transcriptions; (iii) the audio file of the recording in .wav format (48kHz); (iv) a copy of the audio file, also in .wav format, compressed at a frequency of 44kHz; and (v) the original transcription in .pdf format.

2.2.4 Sp2010

The SP2010 project [SP2010_website, MENDES_OUSHIRO_2012] was coordinated by Prof. Ronald Beline Mendes, of the Research Group in Sociolinguistics at FFLCH/USP (GESOL-USP). It started in and ended in to document and study the Portuguese spoken in the city of São Paulo. The project was supported by the FAPESP agency between 2011 and 2013, generating a corpus publicly available for academic research.

The corpus contains recordings of 60 to 70 minutes each, collected between and 242424http://projetosp2010.fflch.usp.br/corpus, with equal division for female and male voices. Each recording identifies an interview with an informant, comprising two parts:

  • an informal and spontaneous conversation, with questions about the informant’s neighborhood, family, childhood, work and leisure, seeking personal involvement;

  • the continuation of the conversation, but exploring a more argumentative speech, with questions on more objective themes about the city of São Paulo, involving problems, solutions, characterizations of the city and its inhabitants. In addition, there are three reading recordings: a list of words, a news article and a statement. Finally, specific questions about the sociolinguistic varieties of the city are proposed.

The informants were selected to represent sociolinguistic profiles characterized by distinct combinations of the following variations: age group, (with three age groups encompassing individuals from to years), education, (with two school stages represented — up to elementary school and with higher education), and gender, (male and female). Each sociolinguistic profile has five informants as representatives, each with a recording. The informants’ region of residence within the city was also considered, and a balance of informants was sought in this regard, considering the division of São Paulo into 3 regions: Centro Velho, Centro Expandido and Periferia.

For the recording, the authors used TASCAM DR100 MK2 digital recorders and Sennheiser HMD25-1 microphones, having varied recording conditions, with some interviews being more noisy than others, as they were not conducted in specialized and isolated environments.

The material collected in the SP2010 project is made available via download from the project website, free of charge to the academic community of researchers. Eight files are available for each interview: two audio files — in .wav stereo format, 44kHz, and also in .mp3; four transcription files (in .eaf, .doc, .txt and textGrid formats); the informant and the recording forms (in .xls format); and a .zip file that contains all of the interview materials except the .wav file.

2.2.5 TEDx Portuguese

TEDx Portuguese is a new corpus compiled specifically for CORAA v1. It should not be confounded with the BP audios available in Multilingual TEDx Corpus (described in Section 2.1). TEDx Portuguese is based on the TEDx Talks252525https://www.ted.com/watch/TeDx-talks, which are events in which presentations on a wide range of topics take place, and in the same format as the TED Talks262626https://www.ted.com/, but in languages other than English.

Although they are independent meetings, they are licensed and guided by the TED organization, that is, they are short presentations, containing prepared speech, with a duration recommendation of less than 18 minutes, typically presented by a single presenter. The “x” at the end indicates that the event is carried out by autonomous entities worldwide. More than 3,000 new recordings are made annually272727https://www.ted.com/about/programs-initiatives/TeDx-program.

To create this dataset, we selected presentations spoken in Portuguese, both from Brazil and Portugal, with available preexisting subtitles. After selecting the presentations, they were downloaded, were the audios were extracted and converted to .wav format, mono, with a sampling rate equal to 44kHz. BP presentations have accents from practically all regions of Brazil.

The subtitles were also downloaded, with the text extracted exclusively, that is, the timestamps were discarded. The dataset is composed of excerpts from 908 talks (671 of which are in BP), totaling at least 908 different speakers, since there are also talks with more than one speaker. The variant (PT-PT or PT-BR) is annotated in the dataset metadata. Considering both variants, there are 543 male and 375 female voices.

3 Data Processing Pipeline

In this section, we present the processing steps of the CORAA corpus:

  1. Normalization of transcriptions,

  2. Segmentation and removal of silence and untranscribed parts of speech,

  3. Forced alignment between audio and corpora transcription for two corpora282828ALIP was not available in an aligned way and TEDx Portuguese were available with segmentation to optimize on-screen presentation,

  4. Specific processing in the ALIP and NURC-Recife corpora. For example, (i) to maintain the capitalization of letters indicative of names, to aid in the expansion of names, (ii) to preserve the slashing annotation, indicative of truncation in the speaker’s speech, to aid in the identification of truncated audios, and (iii) to discard audios with duration less than 0.3 seconds in the NURC-Recife292929The original duration of the corpus (279 hours) dropped to 216 hours.,

  5. Validation of audio-transcription pairs, via the web interface created in the project, so that the CORAA v1 corpus can be used for training ASR methods, and

  6. Evaluation of agreement between annotators and between annotators and the gold-standard annotation, performed by a trained annotator.

All corpora described in Section 2.2 were obtained from their respective official websites. After downloading, all transcripts were converted to .csv format and the organization of audio files was standardized. Additionally, due to the differences between the transcription rules of each corpora, text normalization was performed, described in Section 3.1. Furthermore, as the ALIP corpus does not originally have alignment between the transcription and the audio file, we performed the forced alignment between the transcription and the audio. TEDx Portuguese has the alignment provided by the subtitles, however, this alignment is limited to 42 characters per line to optimize screen display, and may not correspond to sentence boundaries, for this reason we also performed forced alignment in TEDx Portuguese. We describe the forced alignment process in these two corpora in Section 3.2. The validation of the audio-transcription pairs is presented in Section 3.3 and the evaluation of agreement between annotators and between annotators and the gold standard annotation is presented in Section 3.4. Finally, Section 3.5 presents the statistics of the five corpora that make up CORAA, after its pre-processing.

3.1 Text Normalization

The four academic project corpora used their own transcription criteria. The oldest and most widely cited transcription standards are those of the NURC Project, which were used by NURC-Recife. NURC-Recife follows the orthographic transcription and its rules can be found in [Dino_Preti_1999]. During the NURC Digital project, NURC-Recife went through new processing steps, including: quality verification of digitized audio, manual alignment between audio and transcription, spelling revision using a spell checker, which are described by [Oliveira_2016].

The corpus C-Oral-Brasil I follows the orthographic-based transcription criteria, but with the implementation of some non-orthographic criteria to capture grammaticalization or lexicalization phenomena [Raso_Mello_2009]. For example, there are aphereses (disappearance of a phoneme at the beginning of a word), reduced prepositions, absence of plural mark in noun phrases, cliticizations of pronouns and pre-verbal negation and articulations of preposition with article.

The SP2010 project uses semi-orthographic transcriptions, using the following criteria: (i) no change in the spelling of words, as phonetic transcription is not used; (ii) no grammatical corrections; (iii) use of parentheses to indicate the deletion of /r/ in syllabic coda, syllable /es/ of the verb “estar” (to be), in all tenses and verb modes, and syllable “vo” of “você(s)” (you). Other deletions were not indicated with marks. Filled pauses, interjections, and conversational markers such as “right ?”, “okay ?” were pervasively used.

The ALIP project follows the orthographic conventions of the written language, but uses capital initials only for proper names. The transcription annotates the following variable phenomena [Goncalves_2019]: (i) vowel raising in contexts of medial postonic of nouns, as in “c[o]zinha c[u]zinha” and of verbs, as in “d[e]via d[i]via”; (ii) postonic lifting and syncope medial, as in “pes.s[e].go pes.s[i].go pes.go”; (iii) gerund reduction, as in “canta[ndo]  canta[no]”, a striking feature of São Paulo speech.

Results for variable phenomena of morphosyntactic order include, for example, the realization of prepositions with and without contraction, as in “com a cu’a c’a”, “para pra pa”. The corpus proposed a transcription system based on the NURC project and reports the transcription conventions grouped in the following criteria: (i) word spelling, which includes, for example, question and exclamation marks next to the markers discursive and interjections, use of “/” for word truncations; (ii) prosodic elements where it uses an ellipsis for pauses, double-typed colons for lengthening vowels, and interrogation for questions; (iii) interaction in which it identifies the participants of the interaction and use square brackets for voice overlappings; (iv) transcriber’s comments where parentheses are used for hypotheses of what is heard and double parentheses for descriptive comments for laughs, for example.

Considering these differences between the transcriptions and seeking to maintain standardization, we performed the following normalizations in the texts of all CORAA corpora. Some normalizations were performed before validation (items (1), (2), (3)) and practically the entire list below was performed at the end of the entire process, since the ALIP and TEDx Portuguese corpora had their transcriptions revised:

  1. Removal of extra annotations that do not belong to the alignment of transcripts and audios, such as annotations that indicate the speech of the interviewer and interviewee, truncations, laughter and extra information provided by the annotators of the projects that make up CORAA corpus;

  2. Normalization of texts to lower case;

  3. Removal of duplicate spaces;

  4. Expansion of acronyms for their forms of pronunciation (standardization applied after validation, to guarantee the expansion of all acronyms);

  5. Standardization of some uses of filled pauses, using a reduced set of these: ah, eh and uh. Some variations of these representations have been replaced by the closest of the three above (e.g.: hum, hm, uhm was replaced by uh; éh, ehm, ehn, was replaced by eh; huh, uh, ã was replaced by ah);

  6. Expansion of cardinal and ordinal numbers, using the num2words library

    303030https://github.com/savoirfairelinux/num2words;

  7. Percentage sign expansion (%) for its transcribed form (percentage);

  8. Removal of characters such as punctuation and non-language symbols (such as parenthesis and hyphen).

It is important to note that the corpus also brings a great variety of filled pauses forms, so that the model can learn to vary its use, although this richness penalizes the evaluation of models trained with CORAA v1 corpus, as detailed in Section 4.3.

3.2 Automatic Forced Alignment

As mentioned before, in the ALIP and TEDx Portuguese corpora the alignment between the transcripts and audio was performed using an automatic forced alignment method. For this, we use the tool Aeneas313131Available at http://www.readbeyond.it/aeneas.. This tool requires the text segmented into sentences or excerpts.

In the ALIP corpus, the text was segmented using the annotations of pauses or hesitations, indicated by ellipses (“…”) and turn-shifts between speakers, indicated by a line break followed by the next speaker identification abbreviation, present in the original annotated corpus.

In the TEDx Portuguese corpus, the segmentation of text into sentences was performed using the punctuations present in the subtitles, if any. For this, a maximum limit of 30 words was defined for each sentence and, when this limit was reached, the sentence was divided in the point before this limit. In the case of no punctuation, the sentences were divided in an arbitrary way, for example, in silent passages, or with music, or based on variations in speech rate.

3.3 Human Validation via Web-based Platform

The validation of audio-transcription pairs was performed in a simple web interface through two tasks: binary annotation (VALID - INVALID) and transcription to correct automatic alignment effects, as was the case with ALIP corpus, or to review manual transcripts, previously made, as was the case for the TEDx Portuguese corpus.

The binary annotation

was carried out by: listening to an audio file that could be listened to as many times as necessary and the reading of the original transcription. The annotation was binary, that is, the pairs were classified as valid or invalid, and it was necessary to point out the reason for such choice, which provided a guide for the choice itself.

There are 3 main reasons an audio is considered invalid:

  1. Voice overlapping;

  2. Low volume of the main speaker’s voice, making the audio incomprehensible;

  3. Word truncation.

There are also 3 causes for considering a transcript as invalid, i.e. when it is not aligned with the audio, because there are:

  1. Too many words in the transcript;

  2. Too few words;

  3. Words swapped.

The following options were given to validate an audio/transcript pair:

  1. Valid without problems.

  2. Valid with filled pause(s).

  3. Valid with hesitation.

  4. Valid with background noise/low voice but understandable.

  5. Valid with little voice overlapping.

In cases where there is an audio with hesitation but the transcription does not correspond to the pauses made, the pair must be invalidated. After one pair has been annotated, another is provided and this process continues until the user wants to stop the annotation and/or disconnect.

In the web interface for validation, the transcription task has a screen composed of the original transcription, a player for the audio file that can be repeated as many times as necessary, an editing window initially filled with the original transcription, which is used by the annotator to transcribe, and a button to send the transcription. To complete the task of transcribing an audio, the annotator must listen to the audio.

The annotator must also analyze if this audio fits into any of the types below: music, clapping, word truncation in the audio, loud noise or another language other than Portuguese, very low voice, incomprehensible voice, foul words, hate speech, and loud second voice. If so, the annotator should insert the symbols “###” (denoting invalid audio) in the edit window and send its response. As we focused on the BP, we decided to kept 4.69 hours of European Portuguese, so during most of the project, annotators were instructed to discard European Portuguese audios.

The annotators were instructed to comply with the following eight guidelines:

  1. Do not change to the grammar normative form the following signs of orality in the audio: “tá/tó, né, cê, cês, pro, pra, dum, duma, num, numa”.

  2. Transcribe filled pauses, such as “hum, aham, uh” as heard.

  3. Transcribe repetitive hesitations such as “da da”, or “do do” as heard.

  4. Write numbers in full form.

  5. Letters that appear alone should be spelled out.

  6. Acronyms and abbreviations should be transcribed in full form, using the English alphabet for those in English and the Portuguese alphabet for those that appear in Portuguese.

  7. Foreign words should be transcribed normally, in the language in which they appeared.

  8. Punctuation and case sensitivity could be applied, as normalization is performed in post-processing phase.

3.4 Kappa Evaluation: subjectivity of the Human Annotation

The validation of audio-transcription pairs of the CORAA v1 corpus, using the binary annotation and transcription tasks (see Section 3.3), was performed from October 2020 to July 2021, when it was generated the database export.

The number of annotators varied during the project duration. In total, 63 different annotators performed the validation, which could be divided into 4 main annotation groups according to the start and end dates of each annotator on the project. Two groups validated the corpora for 3 months in 2020 (October to December), with some annotators in this group continuing the validation in 2021. There was a 1-month annotation task-force during December 2020. The final group started the CORAA v1 validation work in May 2021 and ended in July 2021.

Each group attended a lecture on the validation process, read the tutorials for the two tasks (annotation and transcription) and received instructions to ask elucidate doubts via the project email throughout the process.

At the beginning of the validation process, from October to December 2020, each audio-transcription pair was annotated by two or three annotators, so that we could use the majority vote to export the data, discarding the divergent pairs, in this initial phase of learning how to validate. Agreement between annotators was calculated in two ways: between annotators who annotated the same pairs (Section 3.4.1) and based on a gold-standard annotation of samples from all datasets, performed by a project member (Section 3.4.2).

3.4.1 Kappa among Annotators

Two Fleiss kappa values were calculated for the annotation from October to December 2020, to separate the groups of annotators. The project started with two groups in October, totaling 28 annotators, but with the entry of a new group on November 23, 2020 the number of annotators went to 63. Thus, it was decided to calculate a kappa value to evaluate each period of annotation — from October 1st to November 23rd and from November 24th to December 31st, 2020. The hypothesis was that the annotation would become easier and with high agreement as the practice increased. However, there is another variable that influenced the agreement: the different transcription rules for each corpus of the CORAA corpus (see Section 3.1) also influenced the agreement. We calculated the agreement value via Fleiss’ kappa twice, once considering only two annotators and the other considering only three annotators, according to the total number of annotators of a given audio. The values are shown in Table 2.

1/10 - 23/11 24/11 - 31/12
2 annotators 3 annotators 2 annotators 3 annotators
Number of pairs 6,785 29,835 26,974 4,224
Number of annotators 25 25 51 51
Kappa Values
C-ORAL Brasil I 0,394 0,353
SP-2010 0,420 0,394
NURC-Recife 0,317 0,314
Total 0,391 0,392 0,317 0,314
Table 2: Kappa values for each dataset in two annotation periods, separated by number of annotators. In the last three months of 2020, the order of validation of the corpora was C-Oral-Brasil I, SP-2010 and NURC-Recife.

It is observed that there are absent values on the table, because the specific corpus was not being annotated in the referred period. The great disagreement between the annotators showed a more subjective task than previously imagined. By manually comparing audios in which annotators agreed with audios in which they disagreed, some points became clear: (i) the human ear naturally tends to complete truncated words, so that different annotators may disagree in defining whether an audio is in fact truncated or not, (ii) background noise level and voice pitch (low/high) are very subjective concepts, and different people are expected to consider different noise levels as tolerable, (iii) naturally, due to the ease of understanding different accents, annotators from different regions of the country tend to understand more or less of the audio according to the their accent, which can also be a source of disagreement.

3.4.2 Kappa for the gold-standard annotation

The gold standard was built to maintain the representativeness of all validated corpora, and all participating annotators, according to the following process:

  1. For each annotated corpus, we generated a list of all annotators in that corpus;

  2. For each name present in the list, five pairs annotated by the annotator were randomly selected (annotators with less than 5 pairs annotated per corpus had their pairs discarded);

  3. The selected pairs were duplicated and annotated by an experienced annotator of the project, creating a gold-standard annotation with the following distribution:

    • Alip: 15 annotators and 75 pairs

    • C-ORAL Brasil I: 24 annotators and 120 pairs,

    • NURC-Recife: 55 annotators and 275 pairs,

    • SP-2010: 25 annotators and 125 pairs,

    • TEDx Portuguese: 50 annotators and 250 pairs.

    • Total: 845 pairs (520 from the binary annotation task and 325 from the transcription task)

The consensus pairs between the annotators were included in the exported dataset, that is, if the absolute majority chose to validate the pairs. Thus, we analyzed the degree of agreement of the annotators together (exported values) in comparison with the gold-standard corpus. The value obtained was 0.514, showing a “moderate agreement”, according to [landis_kock_1977]. Even though the task is subjective, the final result obtained from the annotation of the exported pairs was satisfactory.

3.5 Datasets Statistics

Overall, CORAA has 290.77 hours of validated audios, containing at least 65% of its contents in the form of spontaneous speech. We will refer as the processed version of the corpora in CORAA as sub-datasets. NURC-Recife sub-dataset includes conference and class talks, considered prepared speech (see Table 1). Currently, no other dataset for BP includes audios in this speaking style. Therefore, the task of ASR is more challenging than for other datasets. Another CORAA characteristic is the presence of noise in some of its sub-datasets, which is also more challenging for models created for this task. Table 3 presents statistics for each validated sub-dataset in CORAA v1. The resulting set encompasses almost 1,700 speakers.

Audio durations range, in average, for 2.4 to 7.6 seconds according to sub-dataset. Audios having more than 200 words or 40 seconds were automatically filtered from the dataset. Figure 1 presents estimated speaker distribution in each sub-dataset according to sex. Overall, the distribution is similar for males and females323232In the corpus C-ORAL Brasil I, there is a balance regarding number of uttered words — 50.36% words are uttered by 203 females and 49.64% words are uttered by 159 males. Figure 2 presents audio duration distributions by sub-dataset. The audios are ranked by duration and their relative position (percentil) is shown in the axis. Audios duration are presented in the axis. Percentils are used to simplify sub-dataset comparisons. Figure 3 is similar, but presenting word distribution per dataset.

Corpus ALIP
C-Oral
Brasil I
NURC
Recife
SP2010
TEDx
Port.
Total
Original (hrs) 78 21.13 279 65 249 692.21
Validated (hrs) 35.96 9.64 141.31 31.14 72.74 290.79
BP Speakers 179 362 417 60 671 1,689
Audios
(segmented)
45,006 13,668 261,906 46,482 35,404 402,466
Audio
Duration (sec.)
2.90 2.46 1.94 2.43 7.55 3.39
Avg Tokens 53.910 60.079 20.418 48.118 166.369 41.546
Avg Types 6.391 7.188 3.733 6.002 8.807 5.581
Total Tokens 335,664 99,954 1,378,558 339,890 610,639 2,764,705
Total Types 14,189 8,715 41,903 12,351 27,469 58,237
Type/Token
Ratio
0.042 0.087 0.030 0.036 0.046 0.022
Table 3: Statistics for each processed version of the projects included in CORAA v1 (hours in decimal)
Figure 1: Estimated Speaker Distribution by Sex
Figure 2: Duration distribution per sub-dataset
Figure 3: Word distribution per dataset

Regarding duration, the segmentation process play a role in the obtained durations. Only ALIP and TEDx Portuguese were automatically segmented. The other sub-datasets were manually segmented. For the automatic segmentation, the parameters were adjusted aiming at better segmentation of informational units. ALIP had a similar duration than the others dataset. However, TEDx Portuguese audios tended to be longer. Speech style and genre also play a role in the obtained results. When pronounce is faster and with less pauses, there are less places in the audio that the segmentation software is confident to break the utterances. TEDx Portuguese is the main source of prepared speech in CORAA and had the longest audios and the same applies to word distribution, which is natural since the audios are longer. The remaining corpus presented similar distributions among them.

4 Baseline Model Development

We performed a experiment over CORAA Dataset in order to measure the dataset quality, potentials and limitations. Before the execution of this experiment, the dataset was divided into three subsets: train, development and test. Table 4 presents the approximate number of hours for these sets for each sub-dataset, as well as the number of speakers from each sex. Sub-dataset validation sets were adjusted to have approximately 1 hour. Test sets were built in a similar manner, but having approximately 2 hours. This decision is supported by the work of [sheshadri-etal-2021-wer], which recommends that test sets should have at least 2 hours. NURC-Recife test set contains more than 3 hours of audios, because this sub-dataset have more speech genres than the others. All the audios from European Portuguese were included in the training set.

Duration (hrs) Num. Speakers (M—F)
Subset Train Dev Test Train Dev Test
ALIP 33.40 0.99 1.57 80—87 2—2 4—5
C-ORAL Brasil I 6.54 1.13 1.97 138—181 9—9 12—13
NURC-Recife 137.08 1.29 2.94 295—296 2—1 3—3
SP2010 27.83 1.13 2.18 27—27 1—1 2—2
TEDx Portuguese 68.67 1.37 2.70 532—364 4—4 7—7
Total 273.51 5.91 11.35 1072—955 18—17 28—30
Table 4: Statistics of Train/Dev/Test partitions of each CORAA corpora.

4.1 Experiments

Our proposed experiment is based on the work of [gris2021brazilian]. These authors fine-tuned the model Wav2Vec 2.0 XLSR-53 [conneau2020unsupervised, baevski2020wav2vec]

for ASR, using public available resources for BP. One of their experiments consisted on training 437.2 hours of Brazilian Portuguese. Wav2Vec 2.0 is model that learns quantized latent space representation from audios by solving a contrastive task. First, the model is pre-trained using an unsupervised approach in large datasets. Then, it is fine-tuned for the ASR task using supervised learning. Wav2Vec XLSR-53 is pre-trained over 53 languages, including Portuguese.

In our approach, Wav2Vec XLSR-53 is fine-tuned for CORAA v1. We also evaluated [gris2021brazilian] public fine-tuned model against CORAA v1, using the sets presented in Table 4.

Using the proposed training, development and testing divisions for CORAA v1, we explores training Wav2Vec 2.0 XLSR-53 model using CORAA v1 during 40 epochs. Similarly to the work of

[conneau2020unsupervised] and [gris2021brazilian], we opted to freeze the model feature extractor.

To train the model, we use the framework HuggingFace Transformers [wolf-etal-2020-transformers]. The model was trained with GPU NVIDIA TESLA V100 32GB using a batch size of 8 and gradient accumulation over 24 steps. We used the optimizer AdamW [loshchilov2017decoupled] with a linear learning rate warm-up from 0 to 3e-05 in the first two epochs and after using linear decay to zero. During training the best checkpoint was chosen, using the loss in the development set. The code used to perform the experiment as well as the checkpoint of the trained model are publicly available at: https://github.com/Edresson/Wav2Vec-Wrapper.

4.2 Results and Discussions

Section 4.2.1 presents a comparison of our results with the work of [gris2021brazilian]. The models are tested against the entire test subset of CORAA v1 and Common Voice version 7.0 (Portuguese audios). Therefore, our model is evaluated in-domain using CORAA v1 test set, a dataset in which it was fine-tuned for specific recording characteristics. At the same time, our model is also evaluated out-of-domain in Common Voice, a dataset completely new to our model.

Additionally, Section 4.2.2 focuses on evaluating the models in test sets of CORAA sub-datasets. This enables a more detailed analysis on factors such as audio quality and accents. Finally, Section 4.2.3 investigates the two speech styles: prepared or spontaneous.

4.2.1 In/Out of Domain Evaluation

Table 5 presents the comparison of our experiment with the work of [gris2021brazilian]. First, we performed an in-domain analysis of our model using CORAA v1 test set. Then, our model is evaluated out-of-domain using Common Voice test set. It is important to observe that, for the compared work, the analysis is mirrored, there is, CORAA v1 is the out-of-domain evaluation and Common Voice is the in-domain analysis.

Datasets Common voice CORAA Mean
Metric CER WER CER WER CER WER
[gris2021brazilian] 4.15 13.85 22.32 43.7 13.23 28.77
Our 6.34 20.08 11.02 24.18 8.68 22.13
Table 5: Results for the In/Out of Domain Analysis.

In the Common voice dataset, as expected, [gris2021brazilian] model performed better. Regarding WER, it can be noted that our model is less than 7% above their work. We also focuses our analysis on the metric CER, because for smaller audios, with just a few words, this metric tends to be more reliable. In this scenario, our models are approximately 2% worse than the model from [gris2021brazilian]. On the other hand, in the CORAA dataset, our model presented a much superior performance (more than 19% in WER and 11% in CER). Furthermore, our experiment managed to generalize better for audio characteristics not seen during training, achieving an average higher than the performance of the [gris2021brazilian]. This is very interesting especially because the [gris2021brazilian] model was trained with approximately 147 hours of speech more than our model.

We believe that models trained with the CORAA v1 dataset generalize better than a model trained with existing publicly available datasets for BP due to the spontaneous speech phenomenon and the wide range of noise and different acoustic characteristics present in CORAA. Furthermore, accent can be a factor since the datasets used in the training of the [gris2021brazilian] model may not cover in depth all accents present in the CORAA v1.

4.2.2 Sub-dataset Analysis

There are important differences in the recording environment for each sub-dataset. Additionally, they also varies on accents. Table 6 presents the test performance for each CORAA v1 sub-dataset.

Datasets [gris2021brazilian] Our
CER WER CER WER
ALIP 33.72 59.30 17.30 34.06
C-ORAL Brasil I 23.53 45.9 13.62 28.88
NURC-Recife 19.46 42.17 9.09 22.03
TEDx Portuguese 9.75 22.69 7.43 19.36
SP2010 23.11 42.44 9.57 20.00
Table 6: Results in the CORAA test set for all subsets.

Regarding datasets, ALIP presented the greatest challenge for the models, both for CER and WER metrics. We believe this occurred because audios from ALIP presented more noise than the other sub-datasets.

Regarding accents, we have different results. On one hand, our model presented similar performances in NURC-Recife and SP2010, which have two distinct accents (Recife and São Paulo city). On the other hand, C-ORAL Brasil presented higher WERs and CERs than the other two. Two factors may have influenced this result. First, audio quality and noise presence tend to play a major role in model performances. Second, C-ORAL Brasil accent (Minas Gerais) has two characteristics that are difficult for models: speech rate is faster and there is more word agglutinations. As a consequence, the analysis was inconclusive for this accent, since the results are influenced both by the accent and the speech rate.

Regarding experiments, our model presented results varying from 19 to 34% in WER and from 7 to 17% in CER. On the other hand, [gris2021brazilian] presented higher error rates, which is expected considering the training of their model had no previous contact with CORAA v1 audios.

4.2.3 Spontaneous vs Prepared Speech Analysis

Table 7 presents an analysis in which sub-datasets are merged according to speech style. The Spontaneous Speech column is obtained from the merging of ALIP, C-ORAL Brasil I, SP2010 and parts of NURC-Recife. The prepared speech column contains TEDx Portuguese and parts of NURC-Recife. As expected, the models perform better on prepared speech. However, for several ASR applications, spontaneous speech is more relevant (for example, ASR of phone call and meetings). This can also observed in Section 4.2.2, as TEDx Portuguese presented the lowest error rates.

Speech Style Spontaneous Speech Prepared Speech
Metric CER WER CER WER
[gris2021brazilian] 25.75 49.18 5.30 15.89
Our 12.44 26.5 6.07 18.7
Table 7: Results Spontaneous vs Prepared Speech.

4.3 Error Analysis

The current test dataset is composed of 13,932 audio-transcription pairs, totaling 11.63 hours (see Section 4), with parts from all CORAA v1 dataset.

As this is the first time that a dataset composed of spontaneous speech samples was used to train an ASR model for BP, we performed a more detailed analysis of the errors from our model in a sample of the test dataset.

The 13,932 test pairs were ordered by the CER values of our model to illustrate the different types of errors and to analyze whether there is a relationship of error types with CER values. The automatic transcription was analyzed using the typology of [Mota_2000], adapted for the task of evaluating ASR models.

The typology used here to illustrate the model errors is composed of 11 error types, grouped into 6 more general classes: Alphabetical, Lexical, Morphological, Language and Spontaneous Speech, Semantic, and Diacritic Placement Errors. Below we present a description of the 11 errors with examples.

  • Alphabetical errors are alphabetic writing application errors.
    1) Alphabetical errors occur in 3 situations: by transcribing speech directly into writing, in complex syllables or even with ambiguous letters (“ce” versus “sse” or “sa” versus “za”, in Portuguese).
    An example of this type of error is related to the sound /k/ in Portuguese which is represented by the letter “c” before some vowels and by “qu” before other vowels. Thus, the use of “c” in place of “qu” is associated with the speaking/writing relationship.

  • Lexical errors occur in an excerpt transcribed by the ASR where there is:
    2) omission or addition of words;
    3) exchange of words.
    An example from our dataset regarding addition of a word in the automatic transcription is “que legal” instead of “legal”
    Also from our dataset, an example of word exchange is “e que mais que a gente vida” instead of “e que mais que a gente viu”.

  • Morphological errors are errors that occur due to the violation of writing rules that is linked to the morphological structure of words. These are errors from:
    4) omitting morphemes (e.g. “come” written instead of “comer”);
    5) concatenation of morphemes (e.g., “agente” instead of “a gente”, or “acasa” instead of “a casa” );
    6) separation of morphemes, as in the example: “de ele” written instead of the contraction “dele”).

  • Language and spontaneous speech errors are errors of:
    7) Words in English (or in a language other than BP) wrongly transcribed;
    8) Filled pause errors (e.g., “é” versus “eh” ) where the transcription and model responses diverge;
    9) Spontaneous speech errors (e.g., “tá” versus “está”; “té” versus “até”; “cê” versus “você”) in which transcription and model responses diverge.

  • Semantic errors occur when two words are spelled similarly but have different meanings.
    10) Semantic errors (e.g. “Ela comprimentou o diretor assim que chegou.”, where the correct form would be “cumprimentou”).

  • Diacritic placement errors occur due to missing accents or improperly adding them. They are problematic because the five training corpora were built at different times, in which there were different spelling rules for the Portuguese language. For example, the last orthographic agreement for the Portuguese language came into force in Brazil in 2016.
    11) Accent marks errors.

Table 8 shows examples of 11 errors presented above (column 1), in which the original transcript (column 2) and the model response (column 3) diverge. The location of the error in the snippets appears in bold.

Error
Type
Original Transcription ASR Transcription
1
uma maneira de saber o que e
como o indivíduo identifica algo
uma maneira de saber o que e
como o indivíduo identifiqa algo
2 ou pra dar um apoio moral ou pra dar um apoio im moral
3
o outro foi morar um pouco
mais longe
o outro prai morar um pouco
mais longe
5,4 que lhe dão ora dor ciridão ora do
5 criança é mais

coca cola

* biscoito
criança é mais cocacola biscoito
6
que levaria a uma resposta
aquele estímulo
ah legal faz tempão
que levaria a uma resposta
a quele estímulo
ah legal faz tem pão
7
na teoria de osgood é que
de jazz
na teoria de osguot é que
de dez
8
e essa daí eh
eh
eh
ham
e essa daí é
é
ahn
uhn
9
pra área específica que é o curso
diz que é um curso excelente
para área específica que é o curso
diz que é um curso excelente
10
entendeu era eles suavam mais a
camisa pelo clube entendeu e
entendeu era eles soavam mais a
camisa pelo clube entendeu e
11
então é conhecer a população
usuária do equipamento urbano
então e conhecer a população
usuária do equipamento urbano

** The lack of a hyphen in the test set is only for the calculation of CER/WER.

Table 8: Examples of the 11 different error types.

A sample of 938 audio-transcription pairs was analyzed, of which 134 contained some errors in the audio transcription and thus they were not framed in the typology. Also, 309 pairs were annotated for deletion as their audio were compromised (because of truncation, very loud noise or overlapping voices). The remaining 500 pairs, according to the CER ranges analyzed, are shown in column 1 of Table 9. They were categorized according to the typology presented above. For some pairs more than one error occurs and for some excerpts with high CER values only one error was annotated (the most frequent) although the transcription had many more.

This initial analysis has already resulted in a decision to make a revision in all pairs of the test dataset, which is currently being conducted, and should result in a new version CORAA in the future.

Table 9 shows, in the last column, the variety of error types in each range presented column 1; its frequency is shown in parentheses. We present in bold the most frequent type.

Intervals CER
Analysed
Samples
Error Types (occurrences)
1 — 4,613 0
4,614 — 8,397 CER 110
1 (1), 2 (3), 4 (5), 5 (66), 6 (28),
8 (1), 9 (1), 10 (1), 11 (2)
8,398 — 10,724 CER 10 2 (3), 3 (8), 4 (1), 5 (1), 7 (1)
10,725 — 11,991 CER 10 2 (2), 3 (7), 8 (1)
11,992 — 12,666 CER 10 2 (2), 3 (14), 4 (1), 5 (3), 9 (1)
12,667 — 13,049 CER 25 2 (5), 3 (29), 4 (2), 5 (9), 6 (1), 8 (1)
13,050 — 13,336 CER 10 2 (2), 3 (6), 6 (1), 7 (1), 9 (1)
13,337 — 13,509 CER 10 2 (2), 3 (5), 5 (2), 6 (2), 8 (1)
13,510 — 13,932 CER 315 3 (42), 6 (2), 8 (35), 9 (2)
Table 9: Intervals of CER and frequencies of the different error types.

The lexical error of type 3 — exchange of words — is the most frequent one, which is expected given that the task is automatic transcription, and the training process of these models favors the recognition of frequent and well formed words. Moreover, omission and addition of words (error type 2) is pervasive as it appears in all the intervals (even in the last one, where CER varies from 0.7 to 12, although it was not explicitly annotated). However, the second and third errors classified by frequency are: concatenation and filled pause swap error. The latter is related to the fact that the CORAA dataset has a large percentage of spontaneous speech samples in which both the number and variety of filled pauses are high.

After this error analysis, it became clear the need for more normalization rules for filled pauses representations so that the model accuracy increases.

5 Conclusions and Future Work

In this paper we presented and made publicly available a new dataset called CORAA v1, with 290.77 hours of validated pairs of audio-transcription, composed by public corpora in BP and TEDx Talks in European and Brazilian Portuguese.

Counting on the cooperation among research centers, universities, private companies and The São Paulo Research Foundation (FAPESP), we made publicly available this new and large dataset for training BP speech recognition models, closing the gap of the previous datasets, i.e., the lack of spontaneous and informal speech used in conversations, dialogues and interviews. Informed by the error analysis, we are normalizing filled pauses representations and performing a new validation over the test and development datasets, in order to increase future model accuracy.

As for future work, we plan to augment CORAA with new corpora from Tarsila Project333333https://sites.google.com/view/tarsila-c4ai such as Museu da Pessoa343434https://museudapessoa.org/ and NURC-SP353535https://nurc.fflch.usp.br/. We also plan to create an ASR Challenge including CORAA v1 to further develop research in ASR for the Portuguese language, in order to motivate young researchers in this exciting research area. Finally, we plan to refine the normalization rules of filled pauses and deliver a new version of CORAA dataset.

6 Acknowledgements

This research was funded by CEIA with support by the Goiás State Foundation (FAPEG grant #201910267000527)363636http://centrodeia.org/, Department of Higher Education of the Ministry of Education (SESU/MEC), Copel Holding S.A.373737https://www.copel.com, and Cyberlabs Group383838https://cyberlabs.ai/. In addition, this research was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant #2019/07665-4) and by the IBM Corporation. Also, this study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Finance Code 001. We also would like to thank Nvidia Corporation for the donation of Titan V GPU used in CORAA related projects. The coauthor Anderson da Silva Soares thanks to CNPq for Productivity Scholarship in Technological Development and Innovative Extension - number 308808/2020-7. Finally, the authors would like to thank all the members of the TARSILA project that contributed with discussions and insights regarding the compilation of CORAA v1 corpus.

References