SimEx: Express Prediction of Inter-dataset Similarity by a Fleet of Autoencoders

01/14/2020
by   Inseok Hwang, et al.
6

Knowing the similarity between sets of data has a number of positive implications in training an effective model, such as assisting an informed selection out of known datasets favorable to model transfer or data augmentation problems with an unknown dataset. Common practices to estimate the similarity between data include comparing in the original sample space, comparing in the embedding space from a model performing a certain task, or fine-tuning a pretrained model with different datasets and evaluating the performance changes therefrom. However, these practices would suffer from shallow comparisons, task-specific biases, or extensive time and computations required to perform comparisons. We present SimEx, a new method for early prediction of inter-dataset similarity using a set of pretrained autoencoders each of which is dedicated to reconstructing a specific part of known data. Specifically, our method takes unknown data samples as input to those pretrained autoencoders, and evaluate the difference between the reconstructed output samples against their original input samples. Our intuition is that, the more similarity exists between the unknown data samples and the part of known data that an autoencoder was trained with, the better chances there could be that this autoencoder makes use of its trained knowledge, reconstructing output samples closer to the originals. We demonstrate that our method achieves more than 10x speed-up in predicting inter-dataset similarity compared to common similarity-estimating practices. We also demonstrate that the inter-dataset similarity estimated by our method is well-correlated with common practices and outperforms the baselines approaches of comparing at sample- or embedding-spaces, without newly training anything at the comparison time.

READ FULL TEXT

page 5

page 6

page 10

research
06/08/2021

Multi-dataset Pretraining: A Unified Model for Semantic Segmentation

Collecting annotated data for semantic segmentation is time-consuming an...
research
10/06/2020

Plug and Play Autoencoders for Conditional Text Generation

Text autoencoders are commonly used for conditional generation tasks suc...
research
03/09/2022

Domain Generalization using Pretrained Models without Fine-tuning

Fine-tuning pretrained models is a common practice in domain generalizat...
research
11/22/2020

Enriching ImageNet with Human Similarity Judgments and Psychological Embeddings

Advances in object recognition flourished in part because of the availab...
research
10/03/2022

LPT: Long-tailed Prompt Tuning for Image Classification

For long-tailed classification, most works often pretrain a big model on...
research
10/25/2022

A new Stack Autoencoder: Neighbouring Sample Envelope Embedded Stack Autoencoder Ensemble Model

Stack autoencoder (SAE), as a representative deep network, has unique an...
research
03/15/2021

How Many Data Points is a Prompt Worth?

When fine-tuning pretrained models for classification, researchers eithe...

Please sign up or login with your details

Forgot password? Click here to reset