The Underlying Scaling Laws and Universal Statistical Structure of Complex Datasets

06/26/2023
by   Noam Levi, et al.
0

We study universal traits which emerge both in real-world complex datasets, as well as in artificially generated ones. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure. We focus on the feature-feature covariance matrix, analyzing both its local and global eigenvalue statistics. Our main observations are: (i) The power-law scalings that the bulk of its eigenvalues exhibit are vastly different for uncorrelated random data compared to real-world data, (ii) this scaling behavior can be completely recovered by introducing long range correlations in a simple way to the synthetic data, (iii) both generated and real-world datasets lie in the same universality class from the RMT perspective, as chaotic rather than integrable systems, (iv) the expected RMT statistical behavior already manifests for empirical covariance matrices at dataset sizes significantly smaller than those conventionally used for real-world training, and can be related to the number of samples required to approximate the population power-law scaling behavior, (v) the Shannon entropy is correlated with local RMT structure and eigenvalues scaling, and substantially smaller in strongly correlated datasets compared to uncorrelated synthetic data, and requires fewer samples to reach the distribution entropy. These findings can have numerous implications to the characterization of the complexity of data sets, including differentiating synthetically generated from natural data, quantifying noise, developing better data pruning methods and classifying effective learning models utilizing these scaling laws.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/30/2022

A Solvable Model of Neural Scaling Laws

Large language models with a huge number of parameters, when trained on ...
research
02/08/2021

Learning Curve Theory

Recently a number of empirical "universal" scaling law papers have been ...
research
09/22/2020

Limiting laws for extreme eigenvalues of large-dimensional spiked Fisher matrices with a divergent number of spikes

Consider the p× p matrix that is the product of a population covariance ...
research
05/01/2018

Intrinsic Complexity And Scaling Laws: From Random Fields to Random Vectors

Random fields are commonly used for modeling of spatially (or timely) de...
research
11/25/2021

Cleaning the covariance matrix of strongly nonstationary systems with time-independent eigenvalues

We propose a data-driven way to clean covariance matrices in strongly no...
research
03/22/2016

Completely random measures for modeling power laws in sparse graphs

Network data appear in a number of applications, such as online social n...
research
10/06/2022

Conditional Feature Importance for Mixed Data

Despite the popularity of feature importance measures in interpretable m...

Please sign up or login with your details

Forgot password? Click here to reset