Statistical scalability and approximate inference in distributed computing environments

12/31/2021
by   Aritra Chakravorty, et al.
0

Harnessing distributed computing environments to build scalable inference algorithms for very large data sets is a core challenge across the broad mathematical sciences. Here we provide a theoretical framework to do so along with fully implemented examples of scalable algorithms with performance guarantees. We begin by formalizing the class of statistics which admit straightforward calculation in such environments through independent parallelization. We then show how to use such statistics to approximate arbitrary functional operators, thereby providing practitioners with a generic approximate inference procedure that does not require data to reside entirely in memory. We characterize the L^2 approximation properties of our approach, and then use it to treat two canonical examples that arise in large-scale statistical analyses: sample quantile calculation and local polynomial regression. A variety of avenues and extensions remain open for future work.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/26/2017

PASS-GLM: polynomial approximate sufficient statistics for scalable Bayesian GLM inference

Generalized linear models (GLMs) -- such as logistic regression, Poisson...
research
02/10/2015

Distributed Gaussian Processes

To scale Gaussian processes (GPs) to large data sets we introduce the ro...
research
04/09/2015

Robust, scalable and fast bootstrap method for analyzing large scale data

In this paper we address the problem of performing statistical inference...
research
01/20/2022

Scalable k-d trees for distributed data

Data structures known as k-d trees have numerous applications in scienti...
research
04/17/2020

A Survey of Approximate Quantile Computation on Large-scale Data (Technical Report)

As data volume grows extensively, data profiling helps to extract metada...
research
09/17/2023

An Auto-Parallelizer for Distributed Computing in Haskell

One of the main challenges in distributed computing is building interfac...
research
12/29/2020

Scalable Multivariate Histograms

We give a distributed variant of an adaptive histogram estimation proced...

Please sign up or login with your details

Forgot password? Click here to reset