Extending the R Language with a Scalable Matrix Summarization Operator

10/12/2021
by   Siva Uday Sampreeth Chebolu, et al.
0

Analysts prefer simpler interpreted languages to program their computations. Prominent languages include R, Python, and Matlab. On the other hand, analysts aim to compute mathematical models as fast as possible, especially with large data sets. Data summarization remains a fundamental technique to accelerate machine learning computations. Based on this motivation, we propose a novel summarization mechanism computed via a single matrix multiplication in the statistical R language. We show our summarization benefits a large family of linear models, including Linear Regression, PCA, and Naive Bayes. We present a subsystem that enables exploiting summarization by detecting Gramian matrix products in R. We optimize the existing R source code by overriding the internal R matrix multiplication algorithm using ours. Our solution can be plugged into R and help solving where a similar matrix multiplication appears, much faster and without RAM limitations. Moreover, our solution can be benefited from the parallel processing ability of the summarization matrix. We present an experimental validation showing our subsystem incurs little overhead since it works on source code while providing much faster speeds compared to the R language built-in functions. To round up our comparisons, we also compare our subsystem with Spark in parallel machines. For our solution, we assume that data can be in the HDFS, disk, or already partitioned. Our solution triumphs Spark in most cases proving we can also compete in the big data space.

READ FULL TEXT

Authors

page 1

page 3

10/12/2021

Scalable machine learning in the R language using a summarization matrix

Big data analytics generally rely on parallel processing in large comput...
10/12/2021

A General Summarization Matrix for Scalable Machine Learning Model Computation in the R Language

Data analysis is an essential task for research. Modern large datasets i...
11/06/2018

OverSketch: Approximate Matrix Multiplication for the Cloud

We propose OverSketch, an approximate algorithm for distributed matrix m...
05/20/2020

Sparse approximate matrix-matrix multiplication with error control

We propose a method for strict error control in sparse approximate matri...
02/06/2019

Fast Strassen-based A^t A Parallel Multiplication

Matrix multiplication A^t A appears as intermediate operation during the...
07/18/2019

Random Convolutional Coding for Robust and Straggler Resilient Distributed Matrix Computation

Distributed matrix computations (matrix-vector and matrix-matrix multipl...
02/09/2021

Demystifying Code Summarization Models

The last decade has witnessed a rapid advance in machine learning models...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.