SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese

09/21/2022
by   Luan Thanh Nguyen, et al.
0

Text classification is a typical natural language processing or computational linguistics task with various interesting applications. As the number of users on social media platforms increases, data acceleration promotes emerging studies on Social Media Text Classification (SMTC) or social media text mining on these valuable resources. In contrast to English, Vietnamese, one of the low-resource languages, is still not concentrated on and exploited thoroughly. Inspired by the success of the GLUE, we introduce the Social Media Text Classification Evaluation (SMTCE) benchmark, as a collection of datasets and models across a diverse set of SMTC tasks. With the proposed benchmark, we implement and analyze the effectiveness of a variety of multilingual BERT-based models (mBERT, XLM-R, and DistilmBERT) and monolingual BERT-based models (PhoBERT, viBERT, vELECTRA, and viBERT4news) for tasks in the SMTCE benchmark. Monolingual models outperform multilingual models and achieve state-of-the-art results on all text classification tasks. It provides an objective assessment of multilingual and monolingual BERT-based models on the benchmark, which will benefit future studies about BERTology in the Vietnamese language.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/19/2022

Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi

Transformers are the most eminent architectures used for a vast range of...
research
06/21/2022

muBoost: An Effective Method for Solving Indic Multilingual Text Classification Problem

Text Classification is an integral part of many Natural Language Process...
research
09/28/2020

A Simple and Efficient Ensemble Classifier Combining Multiple Neural Network Models on Social Media Datasets in Vietnamese

Text classification is a popular topic of natural language processing, w...
research
03/06/2022

Graph Neural Network Enhanced Language Models for Efficient Multilingual Text Classification

Online social media works as a source of various valuable and actionable...
research
10/23/2017

Deep Health Care Text Classification

Health related social media mining is a valuable apparatus for the early...
research
03/28/2023

Model and Evaluation: Towards Fairness in Multilingual Text Classification

Recently, more and more research has focused on addressing bias in text ...
research
07/23/2022

Catch Me If You Can: Deceiving Stance Detection and Geotagging Models to Protect Privacy of Individuals on Twitter

The recent advances in natural language processing have yielded many exc...

Please sign up or login with your details

Forgot password? Click here to reset