Extending the R Language with a Scalable Matrix Summarization Operator

by   Siva Uday Sampreeth Chebolu, et al.

Analysts prefer simpler interpreted languages to program their computations. Prominent languages include R, Python, and Matlab. On the other hand, analysts aim to compute mathematical models as fast as possible, especially with large data sets. Data summarization remains a fundamental technique to accelerate machine learning computations. Based on this motivation, we propose a novel summarization mechanism computed via a single matrix multiplication in the statistical R language. We show our summarization benefits a large family of linear models, including Linear Regression, PCA, and Naive Bayes. We present a subsystem that enables exploiting summarization by detecting Gramian matrix products in R. We optimize the existing R source code by overriding the internal R matrix multiplication algorithm using ours. Our solution can be plugged into R and help solving where a similar matrix multiplication appears, much faster and without RAM limitations. Moreover, our solution can be benefited from the parallel processing ability of the summarization matrix. We present an experimental validation showing our subsystem incurs little overhead since it works on source code while providing much faster speeds compared to the R language built-in functions. To round up our comparisons, we also compare our subsystem with Spark in parallel machines. For our solution, we assume that data can be in the HDFS, disk, or already partitioned. Our solution triumphs Spark in most cases proving we can also compete in the big data space.



page 1

page 3


Scalable machine learning in the R language using a summarization matrix

Big data analytics generally rely on parallel processing in large comput...

A General Summarization Matrix for Scalable Machine Learning Model Computation in the R Language

Data analysis is an essential task for research. Modern large datasets i...

OverSketch: Approximate Matrix Multiplication for the Cloud

We propose OverSketch, an approximate algorithm for distributed matrix m...

Sparse approximate matrix-matrix multiplication with error control

We propose a method for strict error control in sparse approximate matri...

Fast Strassen-based A^t A Parallel Multiplication

Matrix multiplication A^t A appears as intermediate operation during the...

Random Convolutional Coding for Robust and Straggler Resilient Distributed Matrix Computation

Distributed matrix computations (matrix-vector and matrix-matrix multipl...

Demystifying Code Summarization Models

The last decade has witnessed a rapid advance in machine learning models...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.