Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning

10/26/2022
by   Pablo Villalobos, et al.
0

We analyze the growth of dataset sizes used in machine learning for natural language processing and computer vision, and extrapolate these using two methods; using the historical growth rate and estimating the compute-optimal dataset size for future predicted compute budgets. We investigate the growth in data usage by estimating the total stock of unlabeled data available on the internet over the coming decades. Our analysis indicates that the stock of high-quality language data will be exhausted soon; likely before 2026. By contrast, the stock of low-quality language data and image data will be exhausted only much later; between 2030 and 2050 (for low-quality language) and between 2030 and 2060 (for images). Our work suggests that the current trend of ever-growing ML models that rely on enormous datasets might slow down if data efficiency is not drastically improved or new sources of data become available.

READ FULL TEXT
research
03/30/2017

Application of a Shallow Neural Network to Short-Term Stock Trading

Machine learning is increasingly prevalent in stock market trading. Thou...
research
08/26/2022

Stock Market Prediction using Natural Language Processing – A Survey

The stock market is a network which provides a platform for almost all m...
research
03/17/2022

Time and the Value of Data

Managers often believe that collecting more data will continually improv...
research
05/25/2023

Scaling Data-Constrained Language Models

The current trend of scaling language models involves increasing both pa...
research
08/07/2023

Towards Machine Learning-based Fish Stock Assessment

The accurate assessment of fish stocks is crucial for sustainable fisher...
research
09/15/2011

Using MOEAs To Outperform Stock Benchmarks In The Presence of Typical Investment Constraints

Portfolio managers are typically constrained by turnover limits, minimum...
research
01/20/2023

Accurately summarizing an outbreak using epidemiological models takes time

Recent outbreaks of monkeypox and Ebola, and worrying waves of COVID-19,...

Please sign up or login with your details

Forgot password? Click here to reset