RadioTalk: a large-scale corpus of talk radio transcripts

07/16/2019
by   Doug Beeferman, et al.
0

We introduce RadioTalk, a corpus of speech recognition transcripts sampled from talk radio broadcasts in the United States between October of 2018 and March of 2019. The corpus is intended for use by researchers in the fields of natural language processing, conversational analysis, and the social sciences. The corpus encompasses approximately 2.8 billion words of automatically transcribed speech from 284,000 hours of radio, together with metadata about the speech, such as geographical location, speaker turn boundaries, gender, and radio program information. In this paper we summarize why and how we prepared the corpus, give some descriptive statistics on stations, shows and speakers, and carry out a few high-level analyses.

READ FULL TEXT

page 2

page 4

research
06/20/2022

The Makerere Radio Speech Corpus: A Luganda Radio Corpus for Automatic Speech Recognition

Building a usable radio monitoring automatic speech recognition (ASR) sy...
research
08/07/2020

CUCHILD: A Large-Scale Cantonese Corpus of Child Speech for Phonology and Articulation Assessment

This paper describes the design and development of CUCHILD, a large-scal...
research
03/24/2022

Lahjoita puhetta – a large-scale corpus of spoken Finnish with some benchmarks

The Donate Speech campaign has so far succeeded in gathering approximate...
research
09/15/2020

Pardon the Interruption: An Analysis of Gender and Turn-Taking in U.S. Supreme Court Oral Arguments

This study presents a corpus of turn changes between speakers in U.S. Su...
research
06/19/2019

Large-Scale Speaker Diarization of Radio Broadcast Archives

This paper describes our initial efforts to build a large-scale speaker ...
research
03/27/2018

Comprehending Real Numbers: Development of Bengali Real Number Speech Corpus

Speech recognition has received a less attention in Bengali literature d...
research
10/07/2015

Hierarchical Representation of Prosody for Statistical Speech Synthesis

Prominences and boundaries are the essential constituents of prosodic st...

Please sign up or login with your details

Forgot password? Click here to reset