BiPhone: Modeling Inter Language Phonetic Influences in Text

07/06/2023
by   Abhirut Gupta, et al.
0

A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the FunGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/04/2018

Investigating the role of L1 in automatic pronunciation evaluation of L2 speech

Automatic pronunciation evaluation plays an important role in pronunciat...
research
04/13/2020

CLUE: A Chinese Language Understanding Evaluation Benchmark

We introduce CLUE, a Chinese Language Understanding Evaluation benchmark...
research
06/05/2023

Second Language Acquisition of Neural Language Models

With the success of neural language models (LMs), their language acquisi...
research
05/30/2022

VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

Recent advances in vision-language pre-training (VLP) have demonstrated ...
research
06/19/2023

Comparison of L2 Korean pronunciation error patterns from five L1 backgrounds by using automatic phonetic transcription

This paper presents a large-scale analysis of L2 Korean pronunciation er...
research
09/25/2020

RecoBERT: A Catalog Language Model for Text-Based Recommendations

Language models that utilize extensive self-supervised pre-training from...

Please sign up or login with your details

Forgot password? Click here to reset