Political corpus creation through automatic speech recognition on EU debates

04/17/2023
by   Hugo de Vos, et al.
0

In this paper, we present a transcribed corpus of the LIBE committee of the EU parliament, totalling 3.6 Million running words. The meetings of parliamentary committees of the EU are a potentially valuable source of information for political scientists but the data is not readily available because only disclosed as speech recordings together with limited metadata. The meetings are in English, partly spoken by non-native speakers, and partly spoken by interpreters. We investigated the most appropriate Automatic Speech Recognition (ASR) model to create an accurate text transcription of the audio recordings of the meetings in order to make their content available for research and analysis. We focused on the unsupervised domain adaptation of the ASR pipeline. Building on the transformer-based Wav2vec2.0 model, we experimented with multiple acoustic models, language models and the addition of domain-specific terms. We found that a domain-specific acoustic model and a domain-specific language model give substantial improvements to the ASR output, reducing the word error rate (WER) from 28.22 to 17.95. The use of domain-specific terms in the decoding stage did not have a positive effect on the quality of the ASR in terms of WER. Initial topic modelling results indicated that the corpus is useful for downstream analysis tasks. We release the resulting corpus and our analysis pipeline for future research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/30/2022

Improving Speech Recognition for Indic Languages using Language Model

We study the effect of applying a language model (LM) on the output of A...
research
06/02/2021

Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights

Automatic speech recognition (ASR) in Sanskrit is interesting, owing to ...
research
08/13/2020

MASRI-HEADSET: A Maltese Corpus for Speech Recognition

Maltese, the national language of Malta, is spoken by approximately 500,...
research
07/20/2021

Seed Words Based Data Selection for Language Model Adaptation

We address the problem of language model customization in applications w...
research
03/25/2022

Plagiarism Detection in the Bengali Language: A Text Similarity-Based Approach

Plagiarism means taking another person's work and not giving any credit ...
research
10/13/2021

Efficient domain adaptation of language models in ASR systems using Prompt-tuning

Automatic Speech Recognition (ASR) systems have found their use in numer...
research
05/24/2023

A Distributed Automatic Domain-Specific Multi-Word Term Recognition Architecture using Spark Ecosystem

Automatic Term Recognition is used to extract domain-specific terms that...

Please sign up or login with your details

Forgot password? Click here to reset