Investigating model performance in language identification: beyond simple error statistics

05/30/2023
by   Suzy J. Styles, et al.
0

Language development experts need tools that can automatically identify languages from fluent, conversational speech, and provide reliable estimates of usage rates at the level of an individual recording. However, language identification systems are typically evaluated on metrics such as equal error rate and balanced accuracy, applied at the level of an entire speech corpus. These overview metrics do not provide information about model performance at the level of individual speakers, recordings, or units of speech with different linguistic characteristics. Overview statistics may therefore mask systematic errors in model performance for some subsets of the data, and consequently, have worse performance on data derived from some subsets of human speakers, creating a kind of algorithmic bias. In the current paper, we investigate how well a number of language identification systems perform on individual recordings and speech units with different linguistic properties in the MERLIon CCS Challenge. The Challenge dataset features accented English-Mandarin code-switched child-directed speech.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/12/2017

The Zero Resource Speech Challenge 2017

We describe a new challenge aimed at discovering subword and word units ...
research
12/19/2022

An Investigation of Indian Native Language Phonemic Influences on L2 English Pronunciations

Speech systems are sensitive to accent variations. This is especially ch...
research
12/25/2017

Leveraging Native Language Speech for Accent Identification using Deep Siamese Networks

The problem of automatic accent identification is important for several ...
research
03/26/2021

Leveraging neural representations for facilitating access to untranscribed speech from endangered languages

For languages with insufficient resources to train speech recognition sy...
research
06/18/2019

The Second DIHARD Diarization Challenge: Dataset, task, and baselines

This paper introduces the second DIHARD challenge, the second in a serie...
research
08/26/2022

Investigating data partitioning strategies for crosslinguistic low-resource ASR evaluation

Many automatic speech recognition (ASR) data sets include a single pre-d...
research
10/18/2022

Risk of re-identification for shared clinical speech recordings

Large, curated datasets are required to leverage speech-based tools in h...

Please sign up or login with your details

Forgot password? Click here to reset