Crowdsourcing Universal Part-Of-Speech Tags for Code-Switching

03/24/2017
by   Victor Soto, et al.
0

Code-switching is the phenomenon by which bilingual speakers switch between multiple languages during communication. The importance of developing language technologies for codeswitching data is immense, given the large populations that routinely code-switch. High-quality linguistic annotations are extremely valuable for any NLP task, and performance is often limited by the amount of high-quality labeled data. However, little such data exists for code-switching. In this paper, we describe crowd-sourcing universal part-of-speech tags for the Miami Bangor Corpus of Spanish-English code-switched speech. We split the annotation task into three subtasks: one in which a subset of tokens are labeled automatically, one in which questions are specifically designed to disambiguate a subset of high frequency words, and a more general cascaded approach for the remaining data in which questions are displayed to the worker following a decision tree structure. Each subtask is extended and adapted for a multilingual setting and the universal tagset. The quality of the annotation process is measured using hidden check questions annotated with gold labels. The overall agreement between gold standard labels and the majority vote is between 0.95 and 0.96 for just three labels and the average recall across part-of-speech tags is between 0.87 and 0.99, depending on the task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/28/2019

Creating a Large Multi-Layered Representational Repository of Linguistic Code Switched Arabic Data

We present our effort to create a large Multi-Layered representational r...
research
04/16/2018

Universal Dependency Parsing for Hindi-English Code-switching

Code-switching is a phenomenon of mixing grammatical structures of two o...
research
12/12/2021

ASCEND: A Spontaneous Chinese-English Dataset for Code-switching in Multi-turn Conversation

Code-switching is a speech phenomenon when a speaker switches language d...
research
11/09/2017

Language Modeling for Code-Switched Data: Challenges and Approaches

Lately, the problem of code-switching has gained a lot of attention and ...
research
08/29/2023

Shared Lexical Items as Triggers of Code Switching

Why do bilingual speakers code-switch (mix their two languages)? Among t...
research
02/22/2021

Creating a Universal Dependencies Treebank of Spoken Frisian-Dutch Code-switched Data

This paper explores the difficulties of annotating transcribed spoken Du...
research
05/20/2021

The Challenge of Variable Effort Crowdsourcing and How Visible Gold Can Help

We consider a class of variable effort human annotation tasks in which t...

Please sign up or login with your details

Forgot password? Click here to reset