QASR: QCRI Aljazeera Speech Resource – A Large Scale Annotated Arabic Speech Corpus

06/24/2021
by   Hamdy Mubarak, et al.
26

We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, punctuation, speaker information among others. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics- based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data. In addition to QASR transcription, we release a dataset of 130M words to aid in designing and training a better language model. We show that end-to-end automatic speech recognition trained on QASR reports a competitive word error rate compared to the previous MGB-2 corpus. We report baseline results for downstream natural language processing tasks such as named entity recognition using speech transcript. We also report the first baseline for Arabic punctuation restoration. We make the corpus available for the research community.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/12/2018

TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation

In this paper, we present TED-LIUM release 3 corpus dedicated to speech ...
research
08/16/2021

NIST SRE CTS Superset: A large-scale dataset for telephony speaker recognition

This document provides a brief description of the National Institute of ...
research
12/25/2021

Multi-Dialect Arabic Speech Recognition

This paper presents the design and development of multi-dialect automati...
research
03/02/2020

Identification of primary and collateral tracks in stuttered speech

Disfluent speech has been previously addressed from two main perspective...
research
08/09/2019

Challenging the Boundaries of Speech Recognition: The MALACH Corpus

There has been huge progress in speech recognition over the last several...
research
12/15/2014

A Broadcast News Corpus for Evaluation and Tuning of German LVCSR Systems

Transcription of broadcast news is an interesting and challenging applic...
research
06/01/2023

On the Robustness of Arabic Speech Dialect Identification

Arabic dialect identification (ADI) tools are an important part of the l...

Please sign up or login with your details

Forgot password? Click here to reset