In-Database Regression in Input Sparsity Time

07/12/2021
by   Rajesh Jayaram, et al.
0

Sketching is a powerful dimensionality reduction technique for accelerating algorithms for data analysis. A crucial step in sketching methods is to compute a subspace embedding (SE) for a large matrix 𝐀∈ℝ^N × d. SE's are the primary tool for obtaining extremely efficient solutions for many linear-algebraic tasks, such as least squares regression and low rank approximation. Computing an SE often requires an explicit representation of 𝐀 and running time proportional to the size of 𝐀. However, if 𝐀= 𝐓_1 𝐓_2 …𝐓_m is the result of a database join query on several smaller tables 𝐓_i ∈ℝ^n_i × d_i, then this running time can be prohibitive, as 𝐀 itself can have as many as O(n_1 n_2 ⋯ n_m) rows. In this work, we design subspace embeddings for database joins which can be computed significantly faster than computing the join. For the case of a two table join 𝐀 = 𝐓_1 𝐓_2 we give input-sparsity algorithms for computing subspace embeddings, with running time bounded by the number of non-zero entries in 𝐓_1,𝐓_2. This results in input-sparsity time algorithms for high accuracy regression, significantly improving upon the running time of prior FAQ-based methods for regression. We extend our results to arbitrary joins for the ridge regression problem, also considerably improving the running time of prior methods. Empirically, we apply our method to real datasets and show that it is significantly faster than existing algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/29/2019

Optimal Sketching for Kronecker Product Regression and Low Rank Approximation

We study the Kronecker product regression problem, in which the design m...
research
09/27/2019

Total Least Squares Regression in Input Sparsity Time

In the total least squares problem, one is given an m × n matrix A, and ...
research
12/12/2019

Sublinear Time Numerical Linear Algebra for Structured Matrices

We show how to solve a number of problems in numerical linear algebra, s...
research
12/08/2015

Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors

We consider two problems that arise in machine learning applications: th...
research
11/02/2020

Coresets for Regressions with Panel Data

This paper introduces the problem of coresets for regression problems to...
research
11/18/2021

Efficiently Transforming Tables for Joinability

Data from different sources rarely conform to a single formatting even i...
research
04/01/2020

Computational Performance of a Germline Variant Calling Pipeline for Next Generation Sequencing

With the booming of next generation sequencing technology and its implem...

Please sign up or login with your details

Forgot password? Click here to reset