The impact of imbalanced training data on machine learning for author name disambiguation

07/30/2018
by   Jinseok Kim, et al.
0

In supervised machine learning for author name disambiguation, negative training data are often dominantly larger than positive training data. This paper examines how the ratios of negative to positive training data can affect the performance of machine learning algorithms to disambiguate author names in bibliographic records. On multiple labeled datasets, three classifiers - Logistic Regression, Naïve Bayes, and Random Forest - are trained through representative features such as coauthor names, and title words extracted from the same training data but with various positive-negative training data ratios. Results show that increasing negative training data can improve disambiguation performance but with a few percent of performance gains and sometimes degrade it. Logistic and Naïve Bayes learn optimal disambiguation models even with a base ratio (1:1) of positive and negative training data. Also, the performance improvement by Random Forest tends to quickly saturate roughly after 1:10 1:15. These findings imply that contrary to the common practice using all training data, name disambiguation algorithms can be trained using part of negative training data without degrading much disambiguation performance while increasing computational efficiency. This study calls for more attention from author name disambiguation scholars to methods for machine learning from imbalanced data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/19/2018

The Random Forest Classifier in WEKA: Discussion and New Developments for Imbalanced Data

Data analysis and machine learning have become an integrative part of th...
research
11/17/2017

An analysis of non-immigrant work visas in the USA using Machine Learning

High-skilled immigrants are a very important factor in US innovation and...
research
11/17/2020

Machine-Learning Number Fields

We show that standard machine-learning algorithms may be trained to pred...
research
03/26/2021

Predictive and explanatory models might miss informative features in educational data

We encounter variables with little variation often in educational data m...
research
02/05/2021

Generating automatically labeled data for author name disambiguation: An iterative clustering method

To train algorithms for supervised author name disambiguation, many stud...
research
02/05/2021

Effect of forename string on author name disambiguation

In author name disambiguation, author forenames are used to decide which...
research
11/01/2021

A Machine Learning Approach for Employee Retention Prediction.

Abstract—Massive investment in employee skills training has been adopted...

Please sign up or login with your details

Forgot password? Click here to reset