Earnings-22: A Practical Benchmark for Accents in the Wild

by   Miguel Del Rio, et al.

Modern automatic speech recognition (ASR) systems have achieved superhuman Word Error Rate (WER) on many common corpora despite lacking adequate performance on speech in the wild. Beyond that, there is a lack of real-world, accented corpora to properly benchmark academic and commercial models. To ensure this type of speech is represented in ASR benchmarking, we present Earnings-22, a 125 file, 119 hour corpus of English-language earnings calls gathered from global companies. We run a comparison across 4 commercial models showing the variation in performance when taking country of origin into consideration. Looking at hypothesis transcriptions, we explore errors common to all ASR systems tested. By examining Individual Word Error Rate (IWER), we find that key speech features impact model performance more for certain accents than others. Earnings-22 provides a free-to-use benchmark of real-world, accented audio to bridge academic and industrial research.



page 1

page 2

page 3

page 4


Earnings-21: A Practical Benchmark for ASR in the Wild

Commonly used speech corpora inadequately challenge academic and commerc...

Toward Zero Oracle Word Error Rate on the Switchboard Benchmark

The "Switchboard benchmark" is a very well-known test set in automatic s...

CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese

Automatic Speech recognition (ASR) is a complex and challenging task. In...

English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech Recognition System

Nowadays, research in speech technologies has gotten a lot out thanks to...

Is Word Error Rate a good evaluation metric for Speech Recognition in Indic Languages?

We propose a new method for the calculation of error rates in Automatic ...

Statistical Testing on ASR Performance via Blockwise Bootstrap

A common question being raised in automatic speech recognition (ASR) eva...

Rethinking Evaluation in ASR: Are Our Models Robust Enough?

Is pushing numbers on a single benchmark valuable in automatic speech re...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic speech recognition systems are utilized in domains ranging from finance to media by providing a rich transcription of the original audio. However, these systems are often error-prone when tasked with transcribing audio with varied features, such as accents, noise, and unique vocal characteristics. Part of the reason for this inaccuracy is that the types of data that are used to train and prepare these models are not always indicative of their use case. For example, an ASR model trained exclusively on American-accented speech using a high-quality microphone is likely to perform poorly if used to transcribe speech from a person speaking Indian-accented English. One way to help identify these issues is to evaluate them using metrics such as WER, IWER[goldwater2010words], and other word characteristics to examine the specific ways in which the models fail. Evaluation datasets which provide a variety of accented speech, along with high quality reference text, are often difficult to obtain. These difficulties arise from having multilingual speakers, non-English disfluencies, and accented speech, to name a few. Company earnings calls are a great source of this type of data, as they provide a large amount of speech variety from having many different speakers, differing accents, and complex domain terminology. Although they contain a concentration of financial jargon, earnings calls still provide a broad coverage of real-world topics. In this paper, we present a compiled dataset of 119 hours that covers 7 regional accents and is freely available to the public. We perform WER analysis using several industry ASR models and compare their performance across accent regions and word characteristics. In creftypecap 2, we describe the data properties and collection methodology. In creftypecap 3 and creftypecap 4, we provide an initial analysis of accuracy disparities between regional accents. Finally, we conclude with a call to action to promote improved accent bias benchmarking in the ASR field.

2 The Earnings-22 Dataset

The Earnings-22 benchmark dataset111This benchmark is available on Github at https://github.com/revdotcom/speech-datasets/tree/master/earnings22. is developed with the intention of providing real-world audio focused on identifying bias in ASR systems. Our attention focused on aggregation of accented public222Earnings calls fair use legal precedent in Swatch Group Management Services Ltd. v. Bloomberg L.P. English-language earnings calls from global companies. We collected a total of 125 earnings calls, totalling 119 hours downloaded from various sources333Most calls are from https://seekingalpha.com/. Few come directly from the company websites: https://www.mtn.com.gh/ and https://transcorpnigeria.com/. The earnings calls in the Earnings-22 corpus are sourced from a total of 27 unique countries which we categorize into 7 regions defined in creftypecap 1.

African Nigeria, Ghana
Asian Indonesia, Turkey, India, Japan, South Korea, China
English United Kingdom, Canada, Australia, United States, South Africa
Germanic Denmark, Sweden, Germany
Other Romance France, Italy, Greece
Slavic Russia, Poland
Spanish / Portuguese Argentina, Brazil, Chile, Spain, Colombia, Portugal
Table 1: Countries included in each language region. See creftypecap 2.2 for further explanation on how these regions were defined.

To produce a broad range of speakers and accents, we focused our efforts on finding unique earnings calls from global companies. The process of properly labeling speaker accents is a difficult task that requires language experts to rate accents and techniques to deal with any disagreements in addition to implicit bias we add as a result of the rating. We opted to instead follow the method used in [oneill21_interspeech] and associate an earnings call with the country of origin where the company was headquartered. To ensure diversity in the call selection, we opted to randomly select 5 calls from as many countries as were available to us. The only exception to this were calls from Ghana and Nigeria, which we actively sought out to improve the coverage of African accents in this dataset. Despite best efforts for these accents, we were only able to find 1 Nigerian and 4 Ghanaian earnings calls. Although these countries were the least represented in the dataset, their inclusion was crucial to improve the overall analytical value of the corpus.

2.1 Creating and Preparing the Transcripts

To ensure high quality transcripts, we submitted our files to our own human transcription platform. Once completed, the quality of each transcript is also verified by a separate group of graders. Following our work in [delrio21_interspeech], we chose to produce verbatim444For more information on Rev.com’s verbatim transcription see https://www.rev.com/blog/resources/verbatim-transcription transcriptions to best model real human speech. We processed the transcripts produced by Rev.com and removed atmospheric information555Examples include information about background music, coughs, and other non-speech noises using our internal processing tools. These files are then converted into our

file format that tokenizes the transcript and contains metadata tagged by our Named Entity Recognition (NER) system.

2.2 Defining Regions

Due to the large number of countries but varying range of representation, the regions we defined aggregated several countries to make bias analysis more practical and conclusive (we show some statistics of those regions in creftypecap 2). The primary region grouping we use is defined with a mixture of language family features and geographical location. Due to the overwhelmingly large amount of Spanish-speaking countries in the dataset compared to other Romance languages, we felt the separation of Spanish / Portuguese and Other Romance was necessary to get a better view on the accents. Portuguese was grouped in with Spanish because of its linguistic similarity as both are a part of the Ibero-Romance group of Romance languages[spanish_portuguese]. Previous work has found that despite the a more distant genetic relationship between Greek and Romance languages, their proximity and long term contact has resulted in significant association both lexically and phonologically[greekincontact] – as a result, we felt they best fit in the Other Romance region. We chose to split South African earnings calls from the African region to follow the distinctions made in previous works [accents_of_english, van_rooy_2020] denoting South African English as more resemblant of other countries in the English region than those in the African region666A resource comparing different accents, maintained by the author of [accents_of_english] can be found at https://www.phon.ucl.ac.uk/home/wells/accentsanddialects/.

Language Regions Time (in Hours) Number of Files
English 22.85 26
Asian 25.27 28
Slavic 7.72 10
Germanic 13.53 12
Spanish/Portuguese 28.87 31
Other Romance 15.61 13
African 5.06 5
Table 2: The Earnings-22 corpus summarized by our defined linguistic regions, composed by considering region and language family.

3 WER Analysis on Earnings-22

To fully showcase the dataset and its characteristics, we provide an initial benchmark along the accent dimension. We used four cloud ASR providers to submit our evaluation audio and obtain hypothesis transcripts. For the Rev.ai models, there were two tested: a Kaldi based model and an end-to-end model.

3.1 Provider and Regional File Breakdown

creftypecap 1 displays the average WER for each provider across accent regions. Here we aggregate by micro-averaging such that long files are weighted more than shorter files.

Figure 1: WER by Region and Provider

These results show that (1) English and Germanic regions seemingly perform the best with ASR relative to the other language regions and (2) Asian and Romance Languages other than Spanish and Portuguese perform the worst with ASR, relative to the other language regions. For English and Germanic, this distinction makes sense, as English is a Germanic language, and most modern ASR models are trained on English-region data. We speculate that the excellent performance on German may be due to highly skilled second-language speakers who may articulate more clearly and speak more slowly than first language speakers, a result also noted in [miller2021corpus]. The results on other regions echo the linguistic distance results of [chiswick2005linguistic]. For example, the Asian region has the biggest WER gap from the English region. This is reinforced by the word level analysis in creftypecap 3.2. Moreover, we observe statistically significant differences in creftypecap 4.

3.2 Regional Word-Level Breakdown

Table 3: Word-level errors across all files. Format of cell is (‘word’, number of files occurred in)
Contraction Financial Jargon

We performed two methods of exploratory word-level error analysis. The first method of analysis involved combining word-level results for each file and provider separately and analyzing the words at the most granular level. The criteria for a word in this analysis was an F1 score of less than 0.3, and an occurrence within the transcript of five times or more. With these criteria, we sought to identify the most difficult words for the ASR systems. From this, we found that a couple domain-specific finance words were commonly incorrect, as well as a few contractions, whose processing often causes trouble for ASR systems[goldwater2010words]. The results of this analysis are shown in Table 3. Finally, we performed a word-level analysis broken down by region. We use the same F1 and frequency criteria as above, but also removed any words that met the criteria in every region. By doing so, we hoped to identify words that ASR systems struggled with particular to each region. The results of this analysis are shown in Table 4.

Table 4: Word-level errors by region and file. Format of cell is (‘word’, number of files occurred in)
Name Contraction Disfluency Abbreviation/Acronym

In this table, we see that the most frequent errors in the Asian and Other Romance regions are common words as opposed to names, abbreviations, or acronyms. In the other regions, we see ASR systems struggle most on these kinds of specific terminology. This finding hints that the degradation of WER seen across region groups is not simply due to more industry-specific jargon or terminology, but poor recognition around common words spoken in regional accents.

4 Exploring Model Bias

In creftypecap 3.1, we found that all providers are impacted when we consider the language region. Although these results indicate clear average WER differences, we don’t know for certain that these results are significant as opposed to natural variations between files and models. In this section we run statistical tests and demonstrate model bias. In particular, we want to understand: (1) is the difference in WER statistically significant and (2) are there features of the transcripts that are impacting the model more. Though previously we’ve been considering multiple different models – we focus in our statistical efforts on our newest Rev.ai V2 model. This is done because its the best performing of those we tested and we know the most details about the pipeline used to train it.

4.1 Measuring Significance

In the following experiments, we use a Monte Carlo Permutation Test[10.5555/1196379] to evaluate our hypotheses. In our implementation, we compare two groups and . After selecting some metric, M, to evaluate those groups, we define


as the absolute difference in between the metric evaluated on the groups. We generate samples, , such that for the sample, we define the subset a random permutation of size from the set and define as . The sample is then


After generating samples, the significance level is defined as the proportion of the samples that have a equal to or larger difference than .


4.2 Model Regional Bias

As we see in creftypecap 1, the Rev.ai V2 model performs differently in every language region. Since we know this model is trained on countries that predominantly belong to the English region, we expect that the differences we’ve observed should result in low p-values if the model is truly biased. In this section, we’ll compare each region to the English region – if any is found to be statistically different, this would imply that the model is significantly impacted by the defined region.

4.2.1 Experimental Setup

We follow the experimental setup defined in creftypecap 4.1, and define to be the files in our baseline English region while is the region we want to evaluate. Our metric, M, for these experiments is the WER over the whole region. Unique to this setup is that instead of generating different permutations of the files, we took a more restrictive approach and generated combinations of files to ensure that each sample was unique777We also ran tests using permutations but found that the results are essentially identical as the number of samples increases.. Finally, we generate 100,000 samples for each experiment and report the p-values in creftypecap 5.

Region p-value
African 0.264
Asian 0.004
Germanic 0.928
Other Romance 0.012
Slavic 0.148
Spanish / Portuguese 0.035
Table 5: Monte Carlo Permutation Test results comparing the English Region to every other region we’ve defined. In the table above, p-values with a are statistically significant at the 0.05 level while those with are significant at the 0.005 level.

4.2.2 Results and Discussion

We find that the Other Romance and Spanish / Portuguese region are statistically different from the English region at the 0.05 level while the Asian region is statistically different from the English region at the 0.005 level. These results reaffirm our findings in creftypecap 3.1 showing that the three worst regions are statistically different from the English region. As previously mentioned, the work in [oneill21_interspeech] found that assuming accents from corporate headquarters seemed to agree with a manual verification of a random sample of earnings calls – given our results and previous work, we believe that these regional groupings do reflect the presence of accented speech that causes our model to struggle.

4.3 Transcript Features

At Rev, we’re particularly confident of our ASR model’s ability to provide verbatim transcripts – in particular, we know that our model is capable of capturing disfluencies like filled pauses and word fragments like false-starts. These words are notably harder than the average word to capture due to their relative infrequency but diverse methods of production. That being said, the work in [goldwater2010words] showed that filled pauses did not have any impact on the recognition of surrounding words while fragmented words do impact a model when comparing the IWER between groups. We explore these features, conditioning on the linguistic differences defined by the various Language Regions.

4.3.1 Experimental Setup

We once again apply the same experimental setup described in creftypecap 4.1. Group is defined as the key words to test while group is all other words in the region. Our metric, M, for these experiments is IWER over the different groups. To define the groups in these experiments, we take advantage of Rev.com’s style guide rules that provide fixed structure to key transcript features888https://cf-public.rev.com/styleguide/transcription/Rev+Transcription+Style+Guide+v4.0.1.pdf. Filled Pauses (FP) – in all our experiments, this group is defined as all uh and um tokens in a given region. Word Fragments (Frag) – due to the variability in the format that word fragments can take, we were unable to specify a list of tokens as fragments. Using the “verbatim” aspect of our transcripts, we define word fragments as all tokens that ended in a “-”. For each experiment we generate 100,000 samples and report the p-values in creftypecap 6.

Region Feature
Before After
FP Frag FP Frag
African 0.044 0.026 0.014 0.000
Asian 0.672 0.000 0.012 0.000
English 0.003 0.000 0.125 0.000
Germanic 0.000 0.000 0.770 0.000
Other Romance 0.000 0.000 0.001 0.000
Slavic 0.000 0.000 0.003 0.000
Span. / Port. 0.000 0.000 0.000 0.000
Table 6: Monte Carlo Permutation Test results comparing the key transcript features in the language regions defined. In the table above, p-values with a are statistically significant at the 0.05 level while those with are significant at the 0.005 level.

4.3.2 Results and Discussion

Our findings in creftypecap 6 seemingly contrast and confirm the work in [goldwater2010words]. Our model’s ability to recognize tokens before and after a word fragment is impacted by its occurrence. But in contrast to previous work, we find that filled pauses do impact our model’s recognition in general. Interestingly enough, for the English and Germanic region where the model performs best, we see that our model is capable of recognizing subsequent words well but not preceding words. We also see that the inverse is true of the Asian region indicating higher performance words that come before the filled pause. Though we leave further investigation of these differences to future work, we theorize that the different behaviors over regions occurs due to distinct filled pauses[kosmala:halshs-03225622] used in each region causing model confusion.

5 Conclusion

Using Earnings-22 we’ve shown that, despite major WER improvements on test suites as a whole, the gaps in performance on accented speech in the wild still leaves more to be desired. Using statistical testing, we further highlight that this discrepancy isn’t due to random noise but rather a real underlying problem of ASR models. With the release of this new corpus, we hope to motivate researchers to work on the problem of real-world accented audio. We challenge all industry and academic leaders to find new techniques to improve model recognition on all voices to create more equitable and fair speech technologies.

6 Acknowledgments

We would like to thank the transcriptionists without whose help and hundreds of hours of work, none of this dataset release would be possible.