A Comparison of Approaches for Imbalanced Classification Problems in the Context of Retrieving Relevant Documents for an Analysis

05/03/2022
by   Sandra Wankmüller, et al.
0

One of the first steps in many text-based social science studies is to retrieve documents that are relevant for the analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists risks drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder, 2017), the Social Bias Inference Corpus (SBIC) (Sap et al., 2020), and the Reuters-21578 corpus (Lewis, 1997). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1,000 documents), reaches a substantially higher retrieval performance than keyword lists.

READ FULL TEXT

page 32

page 35

research
11/17/2019

Quels corpus d'entraînement pour l'expansion de requêtes par plongement de mots : application à la recherche de microblogs culturels

We describe here an experimental framework and the results obtained on m...
research
01/22/2020

Keyword-based Topic Modeling and Keyword Selection

Certain type of documents such as tweets are collected by specifying a s...
research
06/19/2017

Leveraging web resources for keyword assignment to short text documents

Assigning relevant keywords to documents is very important for efficient...
research
07/28/2019

TopicSifter: Interactive Search Space Reduction Through Targeted Topic Modeling

Topic modeling is commonly used to analyze and understand large document...
research
05/31/2022

LEXpander: applying colexification networks to automated lexicon expansion

Recent approaches to text analysis from social media and other corpora r...
research
10/08/2021

Smart Crawling: A New Approach toward Focus Crawling from Twitter

Twitter is a social network that offers a rich and interesting source of...
research
07/13/2020

A supervised term-weighting technique for topic-based retrieval

This article presents a technique for term weighting that relies on a co...

Please sign up or login with your details

Forgot password? Click here to reset