Learning to Recognize Dialect Features

10/23/2020
by   Dorottya Demszky, et al.
1

Linguists characterize dialects by the presence, absence, and frequency of dozens of interpretable features. Detecting these features in text has applications to social science and dialectology, and can be used to assess the robustness of natural language processing systems to dialect differences. For most dialects, large-scale annotated corpora for these features are unavailable, making it difficult to train recognizers. Linguists typically define dialect features by providing a small number of minimal pairs, which are paired examples distinguished only by whether the feature is present, while holding everything else constant. In this paper, we present two multitask learning architectures for recognizing dialect features, both based on pretrained transformers. We evaluate these models on two test sets of Indian English, annotated for a total of 22 dialect features. We find these models learn to recognize many features with high accuracy; crucially, a few minimal pairs can be nearly as effective for training as thousands of labeled examples. We also demonstrate the downstream applicability of our dialect feature detection model as a dialect density measure and as a dialect classifier.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/09/2019

Transformers: State-of-the-art Natural Language Processing

Recent advances in modern Natural Language Processing (NLP) research hav...
research
10/09/2019

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Recent advances in modern Natural Language Processing (NLP) research hav...
research
05/19/2021

Laughing Heads: Can Transformers Detect What Makes a Sentence Funny?

The automatic detection of humor poses a grand challenge for natural lan...
research
04/13/2020

Pretrained Transformers Improve Out-of-Distribution Robustness

Although pretrained Transformers such as BERT achieve high accuracy on i...
research
04/19/2023

Bridging Natural Language Processing and Psycholinguistics: computationally grounded semantic similarity datasets for Basque and Spanish

We present a computationally-grounded word similarity dataset based on t...
research
05/23/2018

Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

For tasks like code synthesis from natural language, code retrieval, and...
research
04/27/2022

Generating Examples From CLI Usage: Can Transformers Help?

Continuous evolution in modern software often causes documentation, tuto...

Please sign up or login with your details

Forgot password? Click here to reset