The Random Forest Classifier in WEKA: Discussion and New Developments for Imbalanced Data

12/19/2018
by   Mario Amrehn, et al.
0

Data analysis and machine learning have become an integrative part of the modern scientific methodology, providing automated techniques to predict further information based on observations. One of these classification and regression techniques is the random forest approach. Those decision tree based predictors are best known for their good computational performance and scalability. However, in case of severely imbalanced training data, as often seen in medical studies' data with large control groups, the training algorithm or the sampling process has to be altered in order to improve the prediction quality for minority classes. In this work, a balanced random forest approach for WEKA is proposed. Furthermore, the prediction quality of the unmodified random forest implementation and the new balanced random forest version for WEKA are evaluated against reference implementations in R. Two-class problems on balanced data sets and imbalanced medical studies' data are investigated. A superior prediction quality using the proposed method for imbalanced data is shown compared to the other three techniques.

READ FULL TEXT
research
12/19/2018

Balanced Random Forest Classifier in WEKA

Data analysis and machine learning have become an integrative part of th...
research
02/28/2019

Improving fraud prediction with incremental data balancing technique for massive data streams

The performance of classification algorithms with a massive and highly i...
research
07/30/2018

The impact of imbalanced training data on machine learning for author name disambiguation

In supervised machine learning for author name disambiguation, negative ...
research
09/28/2022

Applying Machine Learning for Duplicate Detection, Throttling and Prioritization of Equipment Commissioning Audits at Fulfillment Network

VQ (Vendor Qualification) and IOQ (Installation and Operation Qualificat...
research
11/15/2020

Precision-Recall Curve (PRC) Classification Trees

The classification of imbalanced data has presented a significant challe...
research
04/01/2022

Building Decision Forest via Deep Reinforcement Learning

Ensemble learning methods whose base classifier is a decision tree usual...
research
05/09/2023

A Kriging-Random Forest Hybrid Model for Real-time Ground Property Prediction during Earth Pressure Balance Shield Tunneling

A kriging-random forest hybrid model is developed for real-time ground p...

Please sign up or login with your details

Forgot password? Click here to reset