A General Summarization Matrix for Scalable Machine Learning Model Computation in the R Language

Data analysis is an essential task for research. Modern large datasets indeed contain a high volume of data and may require a parallel DBMS, Hadoop Stack, or parallel clusters to analyze them. We propose an alternative approach to these methods by using a lightweight language/system like R to compute Machine Learning models on such datasets. This approach eliminates the need to use cluster/parallel systems in most cases, thus, it paves the way for an average user to effectively utilize its functionality. Specifically, we aim to eliminate the physical memory, time, and speed limitations, that are currently present within packages in R when working with a single machine. R is a powerful language, and it is very popular for its data analysis. However, R is significantly slow and does not allow flexible modifications, and the process of making it faster and more efficient is cumbersome. To address the drawbacks mentioned thus far, we implemented our approach in two phases. The first phase dealt with the construction of a summarization matrix, Γ, from a one-time scan of the source dataset, and it is implemented in C++ using the RCpp package. There are two forms of this Γ matrix, Diagonal and Non-Diagonal Gamma, each of which is efficient for computing specific models. The second phase used the constructed Γ Matrix to compute Machine Learning models like PCA, Linear Regression, Na ̈ıve Bayes, K-means, and similar models for analysis, which is then implemented in R. We bundled our whole approach into a R package, titled Gamma.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/12/2021

Extending the R Language with a Scalable Matrix Summarization Operator

Analysts prefer simpler interpreted languages to program their computati...
research
10/12/2021

Scalable machine learning in the R language using a summarization matrix

Big data analytics generally rely on parallel processing in large comput...
research
08/05/2020

QUBO Formulations for Training Machine Learning Models

Training machine learning models on classical computers is usually a tim...
research
10/20/2022

vivid: An R package for Variable Importance and Variable Interactions Displays for Machine Learning Models

We present vivid, an R package for visualizing variable importance and v...
research
11/20/2018

Parallel Matrix Condensation for Calculating Log-Determinant of Large Matrix

Calculating the log-determinant of a matrix is useful for statistical co...
research
05/20/2019

Tools for analyzing R code the tidy way

With the current emphasis on reproducibility and replicability, there is...
research
05/15/2023

Validity Constraints for Data Analysis Workflows

Porting a scientific data analysis workflow (DAW) to a cluster infrastru...

Please sign up or login with your details

Forgot password? Click here to reset