Foundational principles for large scale inference: Illustrations through correlation mining

05/11/2015
by   Alfred O. Hero, et al.
0

When can reliable inference be drawn in the "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics the dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than the number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data." Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. We demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.

READ FULL TEXT

page 10

page 11

research
01/12/2021

A unified framework for correlation mining in ultra-high dimension

An important problem in large scale inference is the identification of v...
research
02/25/2022

On singular values of large dimensional lag-tau sample autocorrelation matrices

We study the limiting behavior of singular values of a lag-τ sample auto...
research
01/25/2021

Diffusion Asymptotics for Sequential Experiments

We propose a new diffusion-asymptotic analysis for sequentially randomiz...
research
11/11/2018

Swift Two-sample Test on High-dimensional Neural Spiking Data

To understand how neural networks process information, it is important t...
research
10/25/2021

Maximum Correntropy Criterion Regression models with tending-to-zero scale parameters

Maximum correntropy criterion regression (MCCR) models have been well st...
research
10/28/2019

Asymptotic Distributions of High-Dimensional Nonparametric Inference with Distance Correlation

Understanding the nonlinear association between a pair of potentially hi...
research
07/19/2020

Hypothesis tests for structured rank correlation matrices

Joint modeling of a large number of variables often requires dimension r...

Please sign up or login with your details

Forgot password? Click here to reset