Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers

12/01/2022
by   Frederico Dias Souza, et al.
0

Text classification is a natural language processing (NLP) task relevant to many commercial applications, like e-commerce and customer service. Naturally, classifying such excerpts accurately often represents a challenge, due to intrinsic language aspects, like irony and nuance. To accomplish this task, one must provide a robust numerical representation for documents, a process known as embedding. Embedding represents a key NLP field nowadays, having faced a significant advance in the last decade, especially after the introduction of the word-to-vector concept and the popularization of Deep Learning models for solving NLP tasks, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based Language Models (TLMs). Despite the impressive achievements in this field, the literature coverage regarding generating embeddings for Brazilian Portuguese texts is scarce, especially when considering commercial user reviews. Therefore, this work aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese. This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models. The methods are evaluated with five open-source databases with pre-defined data partitions made available in an open digital repository to encourage reproducibility. The Fine-tuned TLMs achieved the best results for all cases, being followed by the Feature-based TLM, LSTM, and CNN, with alternate ranks, depending on the database under analysis.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/19/2020

Deep Learning for Hindi Text Classification: A Comparison

Natural Language Processing (NLP) and especially natural language text a...
research
07/10/2022

Myers-Briggs personality classification from social media text using pre-trained language models

In Natural Language Processing, the use of pre-trained language models h...
research
10/16/2017

Convolutional Neural Networks for Sentiment Classification on Business Reviews

Recently Convolutional Neural Networks (CNNs) models have proven remarka...
research
05/29/2021

Sentiment analysis in tweets: an assessment study from classical to modern text representation models

With the growth of social medias, such as Twitter, plenty of user-genera...
research
01/10/2022

BERT for Sentiment Analysis: Pre-trained and Fine-Tuned Alternatives

BERT has revolutionized the NLP field by enabling transfer learning with...
research
05/25/2020

Deep Learning Models for Automatic Summarization

Text summarization is an NLP task which aims to convert a textual docume...
research
05/17/2023

Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark

Large language models (LLMs) have demonstrated powerful capabilities in ...

Please sign up or login with your details

Forgot password? Click here to reset