Identifying Introductions in Podcast Episodes from Automatically Generated Transcripts

10/14/2021
by   Elise Jing, et al.
0

As the volume of long-form spoken-word content such as podcasts explodes, many platforms desire to present short, meaningful, and logically coherent segments extracted from the full content. Such segments can be consumed by users to sample content before diving in, as well as used by the platform to promote and recommend content. However, little published work is focused on the segmentation of spoken-word content, where the errors (noise) in transcripts generated by automatic speech recognition (ASR) services poses many challenges. Here we build a novel dataset of complete transcriptions of over 400 podcast episodes, in which we label the position of introductions in each episode. These introductions contain information about the episodes' topics, hosts, and guests, providing a valuable summary of the episode content, as it is created by the authors. We further augment our dataset with word substitutions to increase the amount of available training data. We train three Transformer models based on the pre-trained BERT and different augmentation strategies, which achieve significantly better performance compared with a static embedding model, showing that it is possible to capture generalized, larger-scale structural information from noisy, loosely-organized speech data. This is further demonstrated through an analysis of the models' inner architecture. Our methods and dataset can be used to facilitate future work on the structure-based segmentation of spoken-word content.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/19/2022

Content-Context Factorized Representations for Automated Speech Recognition

Deep neural networks have largely demonstrated their ability to perform ...
research
02/27/2023

Multimodal Speech Recognition for Language-Guided Embodied Agents

Benchmarks for language-guided embodied agents typically assume text-bas...
research
09/24/2019

Learning ASR-Robust Contextualized Embeddings for Spoken Language Understanding

Employing pre-trained language models (LM) to extract contextualized wor...
research
12/26/2016

Abstractive Headline Generation for Spoken Content by Attentive Recurrent Neural Networks with ASR Error Modeling

Headline generation for spoken content is important since spoken content...
research
07/20/2021

Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation

Transcripts generated by automatic speech recognition (ASR) systems for ...
research
03/15/2023

Cascading and Direct Approaches to Unsupervised Constituency Parsing on Spoken Sentences

Past work on unsupervised parsing is constrained to written form. In thi...
research
08/23/2016

Towards Machine Comprehension of Spoken Content: Initial TOEFL Listening Comprehension Test by Machine

Multimedia or spoken content presents more attractive information than p...

Please sign up or login with your details

Forgot password? Click here to reset