Scaling associative classification for very large datasets

05/10/2018
by   Luca Venturini, et al.
0

Supervised learning algorithms are nowadays successfully scaling up to datasets that are very large in volume, leveraging the potential of in-memory cluster-computing Big Data frameworks. Still, massive datasets with a number of large-domain categorical features are a difficult challenge for any classifier. Most off-the-shelf solutions cannot cope with this problem. In this work we introduce DAC, a Distributed Associative Classifier. DAC exploits ensemble learning to distribute the training of an associative classifier among parallel workers and improve the final quality of the model. Furthermore, it adopts several novel techniques to reach high scalability without sacrificing quality, among which a preventive pruning of classification rules in the extraction phase based on Gini impurity. We ran experiments on Apache Spark, on a real large-scale dataset with more than 4 billion records and 800 million distinct categories. The results showed that DAC improves on a state-of-the-art solution in both prediction quality and execution time. Since the generated model is human-readable, it can not only classify new records, but also allow understanding both the logic behind the prediction and the properties of the model, becoming a useful aid for decision makers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/22/2022

Evaluating and Crafting Datasets Effective for Deep Learning With Data Maps

Rapid development in deep learning model construction has prompted an in...
research
01/31/2019

Distributed Correlation-Based Feature Selection in Spark

CFS (Correlation-Based Feature Selection) is an FS algorithm that has be...
research
09/17/2022

Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark

Access plan recommendation is a query optimization approach that execute...
research
08/05/2018

Mining CFD Rules on Big Data

Current conditional functional dependencies (CFDs) discovery algorithms ...
research
12/17/2017

A MapReduce-based rotation forest classifier for epileptic seizure prediction

In this era, big data applications including biomedical are becoming att...
research
02/25/2019

The MBPEP: a deep ensemble pruning algorithm providing high quality uncertainty prediction

Machine learning algorithms have been effectively applied into various r...

Please sign up or login with your details

Forgot password? Click here to reset