OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

01/16/2023
by   Jeongkyun Park, et al.
0

Inspired by humans comprehending speech in a multi-modal manner, various audio-visual datasets have been constructed. However, most existing datasets focus on English, induce dependencies with various prediction models during dataset preparation, and have only a small number of multi-view videos. To mitigate the limitations, we recently developed the Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset, which is the largest among publicly available audio-visual speech datasets. The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations. We also provide the pre-trained baseline models for two tasks, audio-visual speech recognition and lip reading. We conducted experiments based on the models to verify the effectiveness of multi-modal and multi-view training over uni-modal and frontal-view-only training. We expect the OLKAVS dataset to facilitate multi-modal research in broader areas such as Korean speech recognition, speaker recognition, pronunciation level classification, and mouth motion analysis.

READ FULL TEXT

page 1

page 6

research
01/21/2023

A Multi-Purpose Audio-Visual Corpus for Multi-Modal Persian Speech Recognition: the Arman-AV Dataset

In recent years, significant progress has been made in automatic lip rea...
research
09/03/2018

LRS3-TED: a large-scale dataset for visual speech recognition

This paper introduces a new multi-modal dataset for visual and audio-vis...
research
03/13/2021

EgoCom: A Multi-person Multi-modal Egocentric Communications Dataset

Multi-modal datasets in artificial intelligence (AI) often capture a thi...
research
07/09/2021

EasyCom: An Augmented Reality Dataset to Support Algorithms for Easy Communication in Noisy Environments

Augmented Reality (AR) as a platform has the potential to facilitate the...
research
10/12/2021

Multi-Modal Pre-Training for Automated Speech Recognition

Traditionally, research in automated speech recognition has focused on l...
research
08/14/2023

VoxBlink: A Large Scale Speaker Verification Dataset on Camera

In this paper, we introduce a large-scale and high-quality audio-visual ...
research
10/16/2018

LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild

Large-scale datasets have successively proven their fundamental importan...

Please sign up or login with your details

Forgot password? Click here to reset