Tradeoffs for Space, Time, Data and Risk in Unsupervised Learning

05/02/2016
by   Mario Lucic, et al.
0

Faced with massive data, is it possible to trade off (statistical) risk, and (computational) space and time? This challenge lies at the heart of large-scale machine learning. Using k-means clustering as a prototypical unsupervised learning problem, we show how we can strategically summarize the data (control space) in order to trade off risk and time when data is generated by a probabilistic model. Our summarization is based on coreset constructions from computational geometry. We also develop an algorithm, TRAM, to navigate the space/time/data/risk tradeoff in practice. In particular, we show that for a fixed risk (or data size), as the data size increases (resp. risk increases) the running time of TRAM decreases. Our extensive experiments on real data sets demonstrate the existence and practical utility of such tradeoffs, not only for k-means but also for Gaussian Mixture Models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/28/2023

Scalable Clustering: Large Scale Unsupervised Learning of Gaussian Mixture Models with Outliers

Clustering is a widely used technique with a long and rich history in a ...
research
11/15/2021

Machine Learning for Genomic Data

This report explores the application of machine learning techniques on s...
research
10/25/2019

Unsupervised Space-Time Clustering using Persistent Homology

This paper presents a new clustering algorithm for space-time data based...
research
11/26/2018

Unsupervised learning with sparse space-and-time autoencoders

We use spatially-sparse two, three and four dimensional convolutional au...
research
03/19/2017

Practical Coreset Constructions for Machine Learning

We investigate coresets - succinct, small summaries of large data sets -...
research
02/20/2019

Stochastic Local Interaction Model with Sparse Precision Matrix for Space-Time Interpolation

The application of geostatistical and machine learning methods based on ...
research
02/27/2017

Scalable and Distributed Clustering via Lightweight Coresets

Coresets are compact representations of data sets such that models train...

Please sign up or login with your details

Forgot password? Click here to reset