Generating Synthetic Datasets by Interpolating along Generalized Geodesics

06/12/2023
by   Jiaojiao Fan, et al.
0

Data for pretraining machine learning models often consists of collections of heterogeneous datasets. Although training on their union is reasonable in agnostic settings, it might be suboptimal when the target domain – where the model will ultimately be used – is known in advance. In that case, one would ideally pretrain only on the dataset(s) most similar to the target one. Instead of limiting this choice to those datasets already present in the pretraining collection, here we explore extending this search to all datasets that can be synthesized as `combinations' of them. We define such combinations as multi-dataset interpolations, formalized through the notion of generalized geodesics from optimal transport (OT) theory. We compute these geodesics using a recent notion of distance between labeled datasets, and derive alternative interpolation schemes based on it: using either barycentric projections or optimal transport maps, the latter computed using recent neural OT methods. These methods are scalable, efficient, and – notably – can be used to interpolate even between datasets with distinct and unrelated label sets. Through various experiments in transfer learning in computer vision, we demonstrate this is a promising new approach for targeted on-demand dataset synthesis.

READ FULL TEXT

page 6

page 7

page 14

research
02/07/2020

Geometric Dataset Distances via Optimal Transport

The notion of task similarity is at the core of various machine learning...
research
04/18/2022

Hierarchical Optimal Transport for Comparing Histopathology Datasets

Scarcity of labeled histopathology data limits the applicability of deep...
research
11/09/2020

A contribution to Optimal Transport on incomparable spaces

Optimal Transport is a theory that allows to define geometrical notions ...
research
10/11/2016

DOTmark - A Benchmark for Discrete Optimal Transport

The Wasserstein metric or earth mover's distance (EMD) is a useful tool ...
research
02/10/2023

Predicting Out-of-Distribution Error with Confidence Optimal Transport

Out-of-distribution (OOD) data poses serious challenges in deployed mach...
research
06/18/2021

Semi-supervised Optimal Transport with Self-paced Ensemble for Cross-hospital Sepsis Early Detection

The utilization of computer technology to solve problems in medical scen...
research
01/09/2020

Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data

Transfer learning has proven to be a successful technique to train deep ...

Please sign up or login with your details

Forgot password? Click here to reset