SentiGOLD: A Large Bangla Gold Standard Multi-Domain Sentiment Analysis Dataset and its Evaluation

06/09/2023
by   Md. Ekramul Islam, et al.
0

This study introduces SentiGOLD, a Bangla multi-domain sentiment analysis dataset. Comprising 70,000 samples, it was created from diverse sources and annotated by a gender-balanced team of linguists. SentiGOLD adheres to established linguistic conventions agreed upon by the Government of Bangladesh and a Bangla linguistics committee. Unlike English and other languages, Bangla lacks standard sentiment analysis datasets due to the absence of a national linguistics framework. The dataset incorporates data from online video comments, social media posts, blogs, news, and other sources while maintaining domain and class distribution rigorously. It spans 30 domains (e.g., politics, entertainment, sports) and includes 5 sentiment classes (strongly negative, weakly negative, neutral, and strongly positive). The annotation scheme, approved by the national linguistics committee, ensures a robust Inter Annotator Agreement (IAA) with a Fleiss' kappa score of 0.88. Intra- and cross-dataset evaluation protocols are applied to establish a standard classification system. Cross-dataset evaluation on the noisy SentNoB dataset presents a challenging test scenario. Additionally, zero-shot experiments demonstrate the generalizability of SentiGOLD. The top model achieves a macro f1 score of 0.62 (intra-dataset) across 5 classes, setting a benchmark, and 0.61 (cross-dataset from SentNoB) across 3 classes, comparable to the state-of-the-art. Fine-tuned sentiment analysis model can be accessed at https://sentiment.bangla.gov.bd.

READ FULL TEXT
research
12/01/2020

BAN-ABSA: An Aspect-Based Sentiment Analysis dataset for Bengali and it's baseline evaluation

Due to the breathtaking growth of social media or newspaper user comment...
research
12/14/2022

Multi-task Learning for Cross-Lingual Sentiment Analysis

This paper presents a cross-lingual sentiment analysis of news articles ...
research
06/24/2023

L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models

The exploration of sentiment analysis in low-resource languages, such as...
research
05/28/2023

RuSentNE-2023: Evaluating Entity-Oriented Sentiment Analysis on Russian News Texts

The paper describes the RuSentNE-2023 evaluation devoted to targeted sen...
research
12/03/2020

Sentiment analysis in Bengali via transfer learning using multi-lingual BERT

Sentiment analysis (SA) in Bengali is challenging due to this Indo-Aryan...
research
04/10/2023

Is ChatGPT a Good Sentiment Analyzer? A Preliminary Study

Recently, ChatGPT has drawn great attention from both the research commu...
research
12/30/2020

DynaSent: A Dynamic Benchmark for Sentiment Analysis

We introduce DynaSent ('Dynamic Sentiment'), a new English-language benc...

Please sign up or login with your details

Forgot password? Click here to reset