Pansori: ASR Corpus Generation from Open Online Video Contents

12/23/2018
by   Yoona Choi, et al.
0

This paper introduces Pansori, a program used to create ASR (automatic speech recognition) corpora from online video contents. It utilizes a cloud-based speech API to easily create a corpus in different languages. Using this program, we semi-automatically generated the Pansori-TEDxKR dataset from Korean TED conference talks with community-transcribed subtitles. It is the first high-quality corpus for the Korean language freely available for independent research. Pansori is released as an open-source software and the generated corpus is released under a permissive public license for community use and participation.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset