Corgi^2: A Hybrid Offline-Online Approach To Storage-Aware Data Shuffling For SGD

09/04/2023
by   Etay Livne, et al.
0

When using Stochastic Gradient Descent (SGD) for training machine learning models, it is often crucial to provide the model with examples sampled at random from the dataset. However, for large datasets stored in the cloud, random access to individual examples is often costly and inefficient. A recent work <cit.>, proposed an online shuffling algorithm called CorgiPile, which greatly improves efficiency of data access, at the cost some performance loss, which is particularly apparent for large datasets stored in homogeneous shards (e.g., video datasets). In this paper, we introduce a novel two-step partial data shuffling strategy for SGD which combines an offline iteration of the CorgiPile method with a subsequent online iteration. Our approach enjoys the best of both worlds: it performs similarly to SGD with random access (even for homogenous data) without compromising the data access efficiency of CorgiPile. We provide a comprehensive theoretical analysis of the convergence properties of our method and demonstrate its practical advantages through experimental results.

READ FULL TEXT
research
06/12/2022

Stochastic Gradient Descent without Full Data Shuffle

Stochastic gradient descent (SGD) is the cornerstone of modern machine l...
research
06/30/2015

Online Learning to Sample

Stochastic Gradient Descent (SGD) is one of the most widely used techniq...
research
04/27/2023

Fairness Uncertainty Quantification: How certain are you that the model is fair?

Fairness-aware machine learning has garnered significant attention in re...
research
07/23/2019

Mix and Match: An Optimistic Tree-Search Approach for Learning Models from Mixture Distributions

We consider a co-variate shift problem where one has access to several m...
research
11/08/2018

Plug-In Stochastic Gradient Method

Plug-and-play priors (PnP) is a popular framework for regularized signal...
research
01/21/2021

Clairvoyant Prefetching for Distributed Machine Learning I/O

I/O is emerging as a major bottleneck for machine learning training, esp...
research
07/03/2022

Distributed Online System Identification for LTI Systems Using Reverse Experience Replay

Identification of linear time-invariant (LTI) systems plays an important...

Please sign up or login with your details

Forgot password? Click here to reset