From Universal Language Model to Downstream Task: Improving RoBERTa-Based Vietnamese Hate Speech Detection

02/24/2021
by   Quang Huu Pham, et al.
0

Natural language processing is a fast-growing field of artificial intelligence. Since the Transformer was introduced by Google in 2017, a large number of language models such as BERT, GPT, and ELMo have been inspired by this architecture. These models were trained on huge datasets and achieved state-of-the-art results on natural language understanding. However, fine-tuning a pre-trained language model on much smaller datasets for downstream tasks requires a carefully-designed pipeline to mitigate problems of the datasets such as lack of training data and imbalanced data. In this paper, we propose a pipeline to adapt the general-purpose RoBERTa language model to a specific text classification task: Vietnamese Hate Speech Detection. We first tune the PhoBERT on our dataset by re-training the model on the Masked Language Model task; then, we employ its encoder for text classification. In order to preserve pre-trained weights while learning new feature representations, we further utilize different training techniques: layer freezing, block-wise learning rate, and label smoothing. Our experiments proved that our proposed pipeline boosts the performance significantly, achieving a new state-of-the-art on Vietnamese Hate Speech Detection campaign with 0.7221 F1 score.

READ FULL TEXT
research
05/14/2019

How to Fine-Tune BERT for Text Classification?

Language model pre-training has proven to be useful in learning universa...
research
12/04/2018

Practical Text Classification With Large Pre-Trained Language Models

Multi-emotion sentiment classification is a natural language processing ...
research
07/09/2019

To Tune or Not To Tune? How About the Best of Both Worlds?

The introduction of pre-trained language models has revolutionized natur...
research
12/31/2020

Unified Mandarin TTS Front-end Based on Distilled BERT Model

The front-end module in a typical Mandarin text-to-speech system (TTS) i...
research
08/28/2018

Disfluency Detection using Auto-Correlational Neural Networks

In recent years, the natural language processing community has moved awa...
research
08/18/2023

BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine

Foundation models (FMs) have exhibited remarkable performance across a w...

Please sign up or login with your details

Forgot password? Click here to reset