Algorithmic Probability of Large Datasets and the Simplicity Bubble Problem in Machine Learning

12/22/2021
by   Felipe S. Abrahão, et al.
0

When mining large datasets in order to predict new data, limitations of the principles behind statistical machine learning pose a serious challenge not only to the Big Data deluge, but also to the traditional assumptions that data generating processes are biased toward low algorithmic complexity. Even when one assumes an underlying algorithmic-informational bias toward simplicity in finite dataset generators, we show that fully automated, with or without access to pseudo-random generators, computable learning algorithms, in particular those of statistical nature used in current approaches to machine learning (including deep learning), can always be deceived, naturally or artificially, by sufficiently large datasets. In particular, we demonstrate that, for every finite learning algorithm, there is a sufficiently large dataset size above which the algorithmic probability of an unpredictable deceiver is an upper bound (up to a multiplicative constant that only depends on the learning algorithm) for the algorithmic probability of any other larger dataset. In other words, very large and complex datasets are as likely to deceive learning algorithms into a "simplicity bubble" as any other particular dataset. These deceiving datasets guarantee that any prediction will diverge from the high-algorithmic-complexity globally optimal solution while converging toward the low-algorithmic-complexity locally optimal solution. We discuss the framework and empirical conditions for circumventing this deceptive phenomenon, moving away from statistical machine learning towards a stronger type of machine learning based on, or motivated by, the intrinsic power of algorithmic information theory and computability theory.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/21/2023

The simplicity bubble effect as a zemblanitous phenomenon in learning systems

The ubiquity of Big Data and machine learning in society evinces the nee...
research
07/06/2022

Low complexity, low probability patterns and consequences for algorithmic probability applications

Developing new ways to estimate probabilities can be valuable for scienc...
research
02/08/2023

Algorithmic Collective Action in Machine Learning

We initiate a principled study of algorithmic collective action on digit...
research
04/05/2019

Data Shapley: Equitable Valuation of Data for Machine Learning

As data becomes the fuel driving technological and economic growth, a fu...
research
06/28/2019

Statistical Learning from Biased Training Samples

With the deluge of digitized information in the Big Data era, massive da...
research
10/07/2019

Algorithmic Probability-guided Supervised Machine Learning on Non-differentiable Spaces

We show how complexity theory can be introduced in machine learning to h...
research
10/16/2019

Scaling up Psychology via Scientific Regret Minimization: A Case Study in Moral Decision-Making

Do large datasets provide value to psychologists? Without a systematic m...

Please sign up or login with your details

Forgot password? Click here to reset