NLP-LTU at SemEval-2023 Task 10: The Impact of Data Augmentation and Semi-Supervised Learning Techniques on Text Classification Performance on an Imbalanced Dataset

by   Sana Sabah Al-Azzawi, et al.

In this paper, we propose a methodology for task 10 of SemEval23, focusing on detecting and classifying online sexism in social media posts. The task is tackling a serious issue, as detecting harmful content on social media platforms is crucial for mitigating the harm of these posts on users. Our solution for this task is based on an ensemble of fine-tuned transformer-based models (BERTweet, RoBERTa, and DeBERTa). To alleviate problems related to class imbalance, and to improve the generalization capability of our model, we also experiment with data augmentation and semi-supervised learning. In particular, for data augmentation, we use back-translation, either on all classes, or on the underrepresented classes only. We analyze the impact of these strategies on the overall performance of the pipeline through extensive experiments. while for semi-supervised learning, we found that with a substantial amount of unlabelled, in-domain data available, semi-supervised learning can enhance the performance of certain models. Our proposed method (for which the source code is available on Github attains an F1-score of 0.8613 for sub-taskA, which ranked us 10th in the competition


Skin lesion classification with ensemble of squeeze-and-excitation networks and semi-supervised learning

In this report, we introduce the outline of our system in Task 3: Diseas...

Augmentation Learning for Semi-Supervised Classification

Recently, a number of new Semi-Supervised Learning methods have emerged....

Augmenting Reddit Posts to Determine Wellness Dimensions impacting Mental Health

Amid ongoing health crisis, there is a growing necessity to discern poss...

Enhanced Offensive Language Detection Through Data Augmentation

Detecting offensive language on social media is an important task. The I...

Challenges in leveraging GANs for few-shot data augmentation

In this paper, we explore the use of GAN-based few-shot data augmentatio...

IITK@Detox at SemEval-2021 Task 5: Semi-Supervised Learning and Dice Loss for Toxic Spans Detection

In this work, we present our approach and findings for SemEval-2021 Task...

Please sign up or login with your details

Forgot password? Click here to reset