Around the world in 60 words: A generative vocabulary test for online research

02/03/2023
by   Pol van Rijn, et al.
2

Conducting experiments with diverse participants in their native languages can uncover insights into culture, cognition, and language that may not be revealed otherwise. However, conducting these experiments online makes it difficult to validate self-reported language proficiency. Furthermore, existing proficiency tests are small and cover only a few languages. We present an automated pipeline to generate vocabulary tests using text from Wikipedia. Our pipeline samples rare nouns and creates pseudowords with the same low-level statistics. Six behavioral experiments (N=236) in six countries and eight languages show that (a) our test can distinguish between native speakers of closely related languages, (b) the test is reliable (r=0.82), and (c) performance strongly correlates with existing tests (LexTale) and self-reports. We further show that test accuracy is negatively correlated with the linguistic distance between the tested and the native language. Our test, available in eight languages, can easily be extended to other languages.

READ FULL TEXT

page 2

page 5

page 6

research
05/25/2023

Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

We create publicly available language identification (LID) datasets and ...
research
07/02/2020

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

This paper describes the Dakshina dataset, a new resource consisting of ...
research
12/19/2022

An Investigation of Indian Native Language Phonemic Influences on L2 English Pronunciations

Speech systems are sensitive to accent variations. This is especially ch...
research
04/14/2022

Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech

This study investigates whether the phonological features derived from t...
research
03/20/2015

On measuring linguistic intelligence

This work addresses the problem of measuring how many languages a person...
research
02/23/2019

Categorization in the Wild: Generalizing Cognitive Models to Naturalistic Data across Languages

Categories such as animal or furniture are acquired at an early age and ...
research
08/05/2020

Multiple Texts as a Limiting Factor in Online Learning: Quantifying (Dis-)similarities of Knowledge Networks across Languages

We test the hypothesis that the extent to which one obtains information ...

Please sign up or login with your details

Forgot password? Click here to reset