EnCBP: A New Benchmark Dataset for Finer-Grained Cultural Background Prediction in English

03/28/2022
by   Weicheng Ma, et al.
0

While cultural backgrounds have been shown to affect linguistic expressions, existing natural language processing (NLP) research on culture modeling is overly coarse-grained and does not examine cultural differences among speakers of the same language. To address this problem and augment NLP models with cultural background features, we collect, annotate, manually validate, and benchmark EnCBP, a finer-grained news-based cultural background prediction dataset in English. Through language modeling (LM) evaluations and manual analyses, we confirm that there are noticeable differences in linguistic expressions among five English-speaking countries and across four states in the US. Additionally, our evaluations on nine syntactic (CoNLL-2003), semantic (PAWS-Wiki, QNLI, STS-B, and RTE), and psycholinguistic tasks (SST-5, SST-2, Emotion, and Go-Emotions) show that, while introducing cultural background information does not benefit the Go-Emotions task due to text domain conflicts, it noticeably improves deep learning (DL) model performance on other tasks. Our findings strongly support the importance of cultural background modeling to a wide variety of NLP tasks and demonstrate the applicability of EnCBP in culture-related research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/18/2022

Challenges and Strategies in Cross-Cultural NLP

Various efforts in the Natural Language Processing (NLP) community have ...
research
01/12/2019

The Importance of Socio-Cultural Differences for Annotating and Detecting the Affective States of Students

The development of real-time affect detection models often depends upon ...
research
09/12/2023

BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models

The rapid development of Large Language Models (LLMs) and the emergence ...
research
08/31/2023

CReHate: Cross-cultural Re-annotation of English Hate Speech Dataset

English datasets predominantly reflect the perspectives of certain natio...
research
04/04/2019

Studying Cultural Differences in Emoji Usage across the East and the West

Global acceptance of Emojis suggests a cross-cultural, normative use of ...
research
05/26/2021

Deception detection in text and its relation to the cultural dimension of individualism/collectivism

Deception detection is a task with many applications both in direct phys...
research
07/07/2023

Crossing the Linguistic Causeway: Ethnonational Differences on Soundscape Attributes in Bahasa Melayu

Despite being neighbouring countries and sharing the language of Bahasa ...

Please sign up or login with your details

Forgot password? Click here to reset