Scalable machine learning in the R language using a summarization matrix

Big data analytics generally rely on parallel processing in large computer clusters. However, this approach is not always the best. CPUs speed and RAM capacity keep growing, making small computers faster and more attractive to the analyst. Machine Learning (ML) models are generally computed on a data set, aggregating, transforming and filtering big data, which is orders of magnitude smaller than raw data. Users prefer “easy” high-level languages like R and Python, which accomplish complex analytic tasks with a few lines of code, but they present memory and speed limitations. Finally, data summarization has been a fundamental technique in data mining that has great promise with big data. With that motivation in mind, we adapt the Γ (Gamma) summarization matrix, previously used in parallel DBMSs, to work in the R language. Γ is significantly smaller than the data set, but captures fundamental statistical properties. Γ works well for a remarkably wide spectrum of ML models, including supervised and unsupervised models, assuming dimensions (variables) are either dependent or independent. An extensive experimental evaluation proves models on summarized data sets are accurate and their computation is significantly faster than R built-in functions. Moreover, experiments illustrate our R solution is faster and less resource-hungry than competing parallel systems including a parallel DBMS and Spark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/12/2021

Extending the R Language with a Scalable Matrix Summarization Operator

Analysts prefer simpler interpreted languages to program their computati...
research
10/12/2021

A General Summarization Matrix for Scalable Machine Learning Model Computation in the R Language

Data analysis is an essential task for research. Modern large datasets i...
research
09/21/2022

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Most of the popular Big Data analytics tools evolved to adapt their work...
research
10/09/2022

An Instance Selection Algorithm for Big Data in High imbalanced datasets based on LSH

Training of Machine Learning (ML) models in real contexts often deals wi...
research
08/27/2021

Machine Learning for Performance Prediction of Spark Cloud Applications

Big data applications and analytics are employed in many sectors for a v...
research
12/17/2017

A MapReduce-based rotation forest classifier for epileptic seizure prediction

In this era, big data applications including biomedical are becoming att...
research
04/26/2018

Big Data Analytic based on Scalable PANFIS for RFID Localization

RFID technology has gained popularity to address localization problem in...

Please sign up or login with your details

Forgot password? Click here to reset