A Speech Test Set of Practice Business Presentations with Additional Relevant Texts

08/02/2019 ∙ by Dominik Macháček, et al. ∙ 0

We present a test corpus of audio recordings and transcriptions of presentations of students' enterprises together with their slides and web-pages. The corpus is intended for evaluation of automatic speech recognition (ASR) systems, especially in conditions where the prior availability of in-domain vocabulary and named entities is benefitable. The corpus consists of 39 presentations in English, each up to 90 seconds long. The speakers are high school students from European countries with English as their second language. We benchmark three baseline ASR systems on the corpus and show their imperfection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays, English is being widely used as lingua franca for communication between people without common first language (denoted as L1). Europe is populated by dozens of nations with various and unique languages. In need for cooperation or interaction, English is often used as a universal first foreign language (or, in other words, the second language a human learns, L2) even between neighboring nations with closely related national languages, e.g. Czech and Polish. At the same time, many people are still not capable of using English and are dependent on translation services, which in turn often rely on human experts. We see an opportunity to boost availability, speed and language coverage of skilled professional translators and interpreters with the help of machines.

In spoken communication, such as during business conferences, the translation relies on speech comprehension. In Europe there are as many varieties of L2 English as there are European languages because many speakers have an accent derived from their L1. Current commonly used corpora for training the ASR systems are often based on audio recordings of English L1 speakers [6, 11], which may not be optimal for ASR of European L2 English. Furthermore, the outputs of ASR systems to date heavily depend on domain coverage of training data and they could be considerably improved by domain adaptation techniques. Also, the pronunciation of named entities from primarily non-English speaking areas usually differs significantly between English L1 and L2 speakers. Big corpora of L1 speakers often do not cover these differences and named entities are a big source of ASR errors and misunderstandings.

In certain situations, it is possible to prepare the ASR or spoken language translation (SLT) system for the specifics of a given talks and speakers. This is due to the fact that the sessions such as conferences and meetings are often planned ahead of time and additional relevant materials such as accompanying presentations to the talks or relevant websites are available.

With this in mind, we have created a corpus consisting of practice presentations of student fictional firms. The corpus contains audio recordings, transcriptions and additional relevant texts (presentation slides and web pages) of the participants. The audio recordings cover English L2 speakers with eight European L1s (cs, sk, it, de, es, ro, hu, nl, fi). Some of the practise firms’ web pages are in English, some of them in local languages. Our corpus is suitable for evaluation of ASR systems, both in settings with and without additional materials provided ahead of time.

In Section 2, we describe the methodology that was used to collect the corpus data. In Section 3, we describe the corpus and its possible applications for the ASR systems. In Section 4, we present evaluation on three distinct English ASR systems. We summarize related works in Section 5 and conclude in Section 6.

2 Methodology

In this section, we explain the methodology we followed when creating the corpus. We collected the data at an international trade fair of student firms (see Section 2.1), during a competition of business presentations (Section 2.2). We motivated the speakers to transcribe their own speech presentations by introducing the Clearest voice competition for valuable prizes (Section 2.3). Additionally, we collected documents related to the student firms (Section 2.4). Throughout the corpus creation, we adhered to ethical standards (Section 2.5).

2.1 Background of Data: Student Firms and Trade Fair

“Student firms” are mock companies established for the practice of running a real company. The participants who run the companies are high-school students, mainly from economically-oriented schools or departments. The firms meet at trade fairs, where they practise promoting their fictional goods or services, issuing invoices for mock trades, and bookkeeping. They also compete in aforementioned tasks and are evaluated based on various criteria by field professionals. The best firms advance into higher rounds of trade fairs, from regional rounds through national into international.

We collected the data at an international trade fair held recently in the Czech Republic. The firms involved in our data collection were from 7 European countries. See Table 1 for a summary.

Country Firms
Czech Republic 18
Italy 8
Romania 4
Slovakia 3
Austria 2
Spain 2
Belgium 2
Total 39
Table 1: Number of student firms included in corpus and their countries of origin.

The trade fair organizers provided us the firms’ presentation slides, which were used by students during the fair. In many cases, we were able to find their web pages and included them into the corpus. See Section 3.3 for more details.

2.2 Presentation Competition

One of the activities during the fair, in which students could participate, was a competition of mock presentations of their businesses. The subject of the competition was to promote the firm to a random stranger in an elevator. The maximal allowed duration of the presentation was 90 seconds and no additional materials were allowed to be shown. The participants had to use English and either one or two students were allowed to give the presentation. A professional three-member committee was evaluating the content considering various aspects of the presentation. The selected competition winners were awarded prices for their performances.

We equipped the speakers with headset microphones to ensure the best possible quality of recordings. Despite of that, there was loud background noise that leaked to the recordings. On the one hand, this adds an extra obstacle for ASR, but on the other hand, the recordings thus represent a real environment where humans interact.

2.3 Manual Transcriptions

In order to obtain manual transcriptions of all the recordings, we asked the participants to transcribe their speech, given only their own recording. To motivate the students, we presented the task as an additional competition for valuable prizes. The objective of this competition was to find out who has the “clearest voice” for ASR. We processed the recordings with English ASR systems, evaluated them and awarded the students based on their respective ASR recognition scores. See Section 4 for more details.

The quality of the transcription was one of the major factors of the competition (together with clarity of speech) because the students had no access to any ASR outputs and had to assume that anything could be recognized correctly. We therefore believe that the students had a strong incentive to provide as accurate transcripts as possible. Furthermore, we reviewed all the transcriptions and edited them to include the missing parts, normalize punctuation and correct the misspellings, but for authenticity, we preserved the original grammar and vocabulary, even when it was not considered as standard English (e.g. massageses as a plural of massage, or botel, pronounced as bottle, as a term for a hotel on a boat).

2.4 Additional Resources

As mentioned above, the participants of this trade fair competed in various disciplines, which included also the preparation of slides and web pages for the fictional companies and their products.

Thanks to our close collaboration with the main organizer of the event, we were able to obtain additional materials, where available. While none of these additional materials were directly used in the presentation competition, they were closely linked to the mock companies and their activity subject. More details on the obtained and processed collection are available in Section 3.3.

We are confident that the students did their best when preparing these materials, motivated by the various competitions. For the purposes of ASR adaptation, the practical usability and overall quality of these materials highly differ from company to company. The relevant topics and named entities for each company are nevertheless mentioned in the corresponding materials.

2.5 Ethical Standards

During the competition and subsequent data evaluation we did comply with the ethical standards, which are in Europe given by General Data Protection Regulation (GDPR). Before the competition has started, all the participants gave us their consent to use and release collected data for research purposes, except of their names and any other personal data. Therefore, we removed the real names of students from the recordings, transcriptions and additional materials, and their photographs from the slides and downloaded web pages.

3 Corpus

The main motivation for collecting the corpus was to test our current ASR models and to gather data for further improvement of their robustness. We believe that the audio recordings contained in the corpus can be beneficial for anyone who wants to deploy their ASR models in real world applications. We also believe that the model performance on these data is a good approximation of its general accuracy in noisy environment.

3.1 Audio Recordings and Transcriptions

The corpus consists of 39 recordings of presentations of fictional student firms. The content of the audio recording corpus is summarized in Table 2 and the native languages of the speakers in Table 3.

Single speaker Two speakers Total
Number of recording 17 22 39
Total audio duration 24m 20s 24m 8s 58m 28s
Transcription words 2891 3722 6613
Distinct speakers 17 44 61
Table 2: Audio and content of the corpus.
Language cs de it es ro sk hu nl fi Total
Single speakers 9 - - 1 3 1 1 - - 17
Two speakers 18 4 16 2 - - - 3 1 44
Table 3: Native languages of the speakers in corpus.

Recordings contain different types of background noise including live music, announcements by organisers of the fair at main stage, conversations in different languages and noise produced by attendees of the presentations.

3.2 Topics

The mock firms involved in the corpus represent a large variety of small or medium-sized companies. We summarize their business fields in Table 4. The most common are travel agencies followed by various food or beverage producers. Each firm is unique, focusing on a very specific segment of the market. Most of the firms fictionally operate only in their local areas.

Business Category  Firms
Travel agencies 7
Food and beverage producers 4
Beauty and health 3
Clothes and shoes 3
Household equipment 3
Online promotion 2
Accessories 2
Logistics 2
Others 13
Total 39
Table 4: Business categories of student firms included in the corpus.

3.3 Additional Resources

We collected additional resources of 36 student firms participating in corpus creation. We are including either their presentation slides, web page or both. The numbers and types of resources are described in Table 5. In total, the additional resources contain 97 000 of words, with a total vocabulary size of 15 000.

Slides Web Firms
20
12
4
3
Table 5: Types of additional materials and number of firms providing them.

In order to protect the privacy of participants, we remove their real names and photographs, however, we preserve all facts that are related to the companies themselves. These include real or fictious email addresses, phone numbers, websites and locations.

The resources included in the corpus come in three distinct formats: the original (either Microsoft Office presentation format, or original web content format such as HTML or pictures), XLIFF format generated by MateCat Filters tool,111http://filters.matecat.com/ which is an XML-based format preserving the original structure of the document and may be useful for translation of the content, advanced information extraction tools, proper sentence segmentation or word-sense disambiguation. We also provide a plaintext format, which we created from XLIFF simply by extracting textual data from the documents. We included plaintexts because they may be convenient for the corpus users, and originals and XLIFF files, because they contain the complete information about the original structure of documents.

3.4 Additional Resources by Languages

The slides are either in Czech, Slovak or English. The web pages are mostly in national languages. Two of them are in multiple parallel language variants and two are in English only. Despite this fact, we believe they can still be valuable resource for ASR or SLT improvement with English as a source. We believe that the named entities or specific in-domain vocabulary of the spoken presentation, which could otherwise be left unrecognized, may be inferred from these documents even automatically.

We provide the language counts of presentations and web pages in Table 6. We note that there is one company in the corpus whose presenter’s L1 was Hungarian, their slides were English and web page in Romanian.

All the documents in the corpus are marked with language tags.

Lang. cs en de it es ro sk cs/en ro/en sk/en it/en/es/de total
Slides 14 15 - - - - 1 1 - 1 - 32
Web 14 2 2 2 1 1 - - 1 - 1 23
Table 6: Languages of presentation materials

4 Evaluation of ASR

In order to document the state of the art of ASR, we evaluated three ASR systems on the corpus.

4.1 The ASR Systems

We consider three different ASR systems:

Janus Recognition Toolkit (JRTk) [9] featuring the IBIS single pass decoder [15]. Its acoustic model was trained on TED talks [6] and Broadcast News [4]. This system was designed to recognize lecture talks from IWSLT 2017 workshop [17].

Google Cloud Speech-to-Text222https://cloud.google.com/speech-to-text/ with English (United States) language option.

Kaldi [12] based model trained on data from Multi-Genre Broadcast Challenge [2], on 1600 hours of broadcast audio from BBC TV and several hundred million words of subtitle text for language modeling. This model is thus suitable mainly for native British English speakers.

We also tried Microsoft Cloud ASR but it failed for all our recordings.

Recognized by all All recordings
Google Kaldi B. JRTk Google Kaldi B. JRTk
Mean 73.59 87.55 45.21 89.32 87.47 45.63
Min 20.90 83.96 25.00 20.90 83.96 25.00
Max 98.31 91.03 74.08 100.00 91.03 99.58
Median 87.50 87.59 43.41 100.00 87.04 46.31
Stddev 27.87 2.29 15.28 21.82 1.92 15.23
Table 7: WER of JRTk, Kaldi BBC and Google model scores on all recordings in the corpus (right) and on the recordings on which all the systems produced some output (left). WER of 100% indicates that no output was provided.

4.2 Evaluation Metric

We use the standard word error rate (WER) metric, which is the minimum number of text insertions, deletions and substitutions needed to transfer one document to another, normalized by the total number of words in the document. As customary in ASR development, we disregard letter case and punctuation for this evaluation. We took the transcriptions obtained from the participants as the ground truth against which the automatic speech recognition outputs were evaluated.

4.3 Results

The descriptive statistics of respective word error rate scores are listed in Table 

7 and visualized in Figure 1. Note that the lower WER, the better recognition.

As already discussed in Section 3, the audio files contain a significant amount of background noise. Due to this fact, Google returned an empty output in some cases, resulting in the WER of 100%. In order to account this, we selected only the recordings on which all the systems had less than 100% WER, and measured a second set of descriptive statistics on this subset.

By manually inspecting the recordings on which the systems had the highest error rate we observed that the ASR difficulties could have been caused by a very strong accent of the speaker, or by the fact that the microphone was not in the appropriate distance from the mouth, or that the speaker did not articulate clearly. Also, the background conditions such as a music band playing or students entering the presentation room may have affected the recognition quality.

20

40

60

80

100

Google

Kaldi BBC

JRTk

WER

Recognized by all

20

40

60

80

100

Google

Kaldi BBC

JRTk

WER

All recordings
Figure 1: Boxplot showing the word error rate scores of Google, Kaldi BBC and JRTk models on all recordings (right) and on a subset where all the systems produced some non-empty output (left).

5 Related Works

Tests sets for ASR are usually released together with speech corpora [6, 11, 5]. Our corpus is unique in a way that it contains L2 English, similarly as [20], but in our corpus there is a large variety of speakers, European L1s and realistic background noise conditions. Also, to our best knowledge, there is not any other speech corpus with additional in-domain resources.

Robustness to noise: There are some corpora intended for noisy speech recognition: [7, 1, 18, 14]. In [10], the authors show that model trained on a large data-set of distorted data with background noise is able to generalize much better than domain-specific models. Similar conclusions were derived in [8], where the authors experimented with random sampling of noise and intentionally corrupting the training data.

Non-native speech: Adaptation for non-native speech in low-resource scenarios was studied by [19]

, who proposed interpolation of acoustic models or polyphone decision tree specialization. This can be incorporated into statistical ASR systems. For hybrid HMM-DNN (Hidden Markov models and deep neural networks) models, data selection methods can be used. In

[16], combination of L2 out-of-domain read speech and L2 in-domain spontaneous speech led to the highest improvements, as opposed to using L1 speech.

Domain adaptation: For purely neural LF-MMI (Lattice-free maximum mutual information) models [13]

, multi-task learning with large out-of-domain data as a first task and in-domain data as a second task, or various approaches of transfer learning can be beneficial

[3].

6 Conclusion

We presented a small English speech corpus (only about 1 hour in total) intended as a test set for challenging speech recognition conditions: 61 distinct speakers, none of which were native speakers of English, a diverse set of vocabulary domains and noisy background.

We have demonstrated that current ASR systems have severe difficulties in processing the test set, with WER ranging from 40 to 100% on individual audio recordings. The test set is equipped with additional text materials which can serve as evaluation of domain adaptation.

The corpus is publicly released and available under the following link:

http://hdl.handle.net/11234/1-3023.

Acknowledgements

This research was supported in parts by the grants H2020-ICT-2018-2-825460 (ELITR) of the European Union and 19-26934X (NEUREM3) of Czech Science Foundation.

We are grateful to the organization team of the fictional student firms fair, who allowed us to conduct the competition during the event. We are also grateful to the students, who presented their firm and transcribed their audio recordings. Last but not least we are thankful to the team in Karlsruhe Institute of Technology and to the PerVoice team, who helped us overcome the technical difficulties that we have encountered.

References

  • [1] Abdulaziz, A., Kepuska, V.: Noisy TIMIT Speech LDC2017S04. In: Linguistic Data Consortium (LDC). Linguistic Data Consortium (LDC), University of Pennsylvania (2017)
  • [2] Bell, P., Gales, M., Hain, T., Kilgour, J., Lanchantin, P., Liu, X., McParland, A., Renals, S., Saz, O., Wester, M., Woodland, P.: The MGB Challenge: Evaluating Multi-Genre Broadcast Media Recognition. In: Proc. ASRU (2015)
  • [3] Ghahremani, P., Manohar, V., Hadian, H., Povey, D., Khudanpur, S.: Investigation of Transfer Learning for ASR Using LF-MMI Trained Neural Networks. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). pp. 279–286 (Dec 2017)
  • [4] Graff, D.: The 1996 Broadcast News Speech And Language-Model Corpus. In: Proceedings of the 1997 DARPA Speech Recognition Workshop. pp. 11–14 (1996)
  • [5] Gretter, R.: Euronews: a Multilingual Speech Corpus for ASR. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). pp. 2635–2638. European Language Resources Association (ELRA), Reykjavik, Iceland (May 2014)
  • [6] Hernandez, F., Nguyen, V., Ghannay, S., Tomashenko, N.A., Estève, Y.: TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation. CoRR abs/1805.04699 (2018), http://arxiv.org/abs/1805.04699
  • [7] Hu, Y., Loizou, P.: Subjective Comparison of Speech Enhancement Algorithms. In: Proc. of ICASSP. vol. 1 (Jun 2006)
  • [8] Kim, C., Misra, A., Chin, K., Hughes, T., Narayanan, A., Sainath, T., Bacchiani, M.: Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home. In: Proc. of INTERSPEECH (Aug 2017)
  • [9] Lavie, A., Waibel, A., Levin, L., , Gates, D., , Zeppenfeld, T., Zhan, P.: JANUS III: Speech-to-speech Translation in Multiple Languages. In: Proceedings of ICASSP 97 (Jan 1997)
  • [10] Narayanan, A., Misra, A., Sim, K.C., Pundak, G., Tripathi, A., Elfeky, M., Haghani, P., Strohman, T., Bacchiani, M.: Toward Domain-Invariant Speech Recognition via Large Scale Training. In: 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, December 18-21, 2018. pp. 441–447 (2018), https://doi.org/10.1109/SLT.2018.8639610
  • [11] Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: An ASR Corpus Based on Public Domain Audio Books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5206–5210 (Apr 2015)
  • [12] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlíček, P., Qian, Y., Schwarz, P., Silovský, J., Stemmer, G., Veselý, K.: The Kaldi Speech Recognition Toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society (Dec 2011), iEEE Catalog No.: CFP11SRW-USB
  • [13] Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., Wang, Y., Khudanpur, S.: Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI. In: Interspeech 2016. pp. 2751–2755 (2016), http://dx.doi.org/10.21437/Interspeech.2016-595
  • [14] Schmidt-Nielsen, A., Marsh, E., Tardelli, J., Gatewood, P., Kreamer, E., Tremain, T., Cieri, C., Wright, J.: Speech in Noisy Environments (SPINE) Training Audio LDC2000S87. In: Linguistic Data Consortium (LDC). Linguistic Data Consortium (LDC), University of Pennsylvania (2000)
  • [15] Soltau, H., Metze, F., Fugen, C., Waibel, A.: A One-Pass Decoder Based on Polymorphic Linguistic Context Assignment. In: IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU ’01. pp. 214–217 (Dec 2001)
  • [16] Tchistiakova, S.: Acoustic Models for Second Language Learners. master thesis, Universität des Saarlandes, Università degli studi di Trento (2018)
  • [17] Thai-Son Nguyen and Markus Müller and Sebastian Sperber and Thomas Zenkel and Sebastian Stüker and Alex Waibel: The 2017 KIT IWSLT Speech-to-Text Systems for English and German. In: The International Workshop on Spoken Language Translation (IWSLT). Tokyo, Japan (December, 14-15 2017)
  • [18] Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., Matassoni, M.: The second ‘CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 162–167 (Dec 2013)
  • [19] Wang, Z., Schultz, T., Waibel, A.: Comparison of Acoustic Model Adaptation Techniques on Non-Native Speech. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. vol. 1 (May 2003)
  • [20] Zhao, G., Sonsaat, S., Silpachai, A., Lucic, I., Chukharev-Hudilainen, E., Levis, J., Gutierrez-Osuna, R.: L2-ARCTIC: A Non-Native English Speech Corpus. In: Proc. Interspeech 2018. pp. 2783–2787 (2018), http://dx.doi.org/10.21437/Interspeech.2018-1110