Retrieval-based Text Selection for Addressing Class-Imbalanced Data in Classification

07/27/2023
by   Sareh Ahmadi, et al.
0

This paper addresses the problem of selecting of a set of texts for annotation in text classification using retrieval methods when there are limits on the number of annotations due to constraints on human resources. An additional challenge addressed is dealing with binary categories that have a small number of positive instances, reflecting severe class imbalance. In our situation, where annotation occurs over a long time period, the selection of texts to be annotated can be made in batches, with previous annotations guiding the choice of the next set. To address these challenges, the paper proposes leveraging SHAP to construct a quality set of queries for Elasticsearch and semantic search, to try to identify optimal sets of texts for annotation that will help with class imbalance. The approach is tested on sets of cue texts describing possible future events, constructed by participants involved in studies aimed to help with the management of obesity and diabetes. We introduce an effective method for selecting a small set of texts for annotation and building high-quality classifiers. We integrate vector search, semantic search, and machine learning classifiers to yield a good solution. Our experiments demonstrate improved F1 scores for the minority classes in binary classification.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/16/2020

CompLex — A New Corpus for Lexical Complexity Predicition from Likert Scale Data

Predicting which words are considered hard to understand for a given tar...
research
11/30/2020

Binary Classification: Counterbalancing Class Imbalance by Applying Regression Models in Combination with One-Sided Label Shifts

In many real-world pattern recognition scenarios, such as in medical app...
research
11/22/2018

ICPRAI 2018 SI: On dynamic ensemble selection and data preprocessing for multi-class imbalance learning

Class-imbalance refers to classification problems in which many more ins...
research
11/16/2019

An "outside the box" solution for imbalanced data classification

A common problem of the real-world data sets is the class imbalance, whi...
research
10/05/2021

Tradeoffs in Streaming Binary Classification under Limited Inspection Resources

Institutions are increasingly relying on machine learning models to iden...
research
08/11/2017

Break it Down for Me: A Study in Automated Lyric Annotation

Comprehending lyrics, as found in songs and poems, can pose a challenge ...
research
07/04/2023

Optimal and Efficient Binary Questioning for Human-in-the-Loop Annotation

Even though data annotation is extremely important for interpretability,...

Please sign up or login with your details

Forgot password? Click here to reset