Improving Dataset Distillation

10/06/2019
by   Ilia Sucholutsky, et al.
46

Dataset distillation is a method for reducing dataset sizes: the goal is to learn a small number of synthetic samples containing all the information of a large dataset. This has several benefits: speeding up model training in deep learning, reducing energy consumption, and reducing required storage space. Currently, each synthetic sample is assigned a single `hard' label, which limits the accuracies that models trained on distilled datasets can achieve. Also, currently dataset distillation can only be used with image data. We propose to simultaneously distill both images and their labels, and thus to assign each synthetic sample a `soft' label (a distribution of labels) rather than a single `hard' label. Our improved algorithm increases accuracy by 2-4 classification tasks. For example, training a LeNet model with just 10 distilled images (one per class) results in over 96 data. Using `soft' labels also enables distilled datasets to consist of fewer samples than there are classes as each sample can encode information for more than one class. For example, we show that LeNet achieves almost 92 MNIST after being trained on just 5 distilled images. We also propose an extension of the dataset distillation algorithm that allows it to distill sequential datasets including texts. We demonstrate that text distillation outperforms other methods across multiple datasets. For example, we are able to train models to almost their original accuracy on the IMDB sentiment analysis task using just 20 distilled sentences.

READ FULL TEXT

page 2

page 11

page 14

page 15

research
11/27/2018

Dataset Distillation

Model distillation aims to distill the knowledge of a complex model into...
research
06/15/2020

Flexible Dataset Distillation: Learn Labels Instead of Images

We study the problem of dataset distillation - creating a small set of s...
research
10/19/2020

New Properties of the Data Distillation Method When Working With Tabular Data

Data distillation is the problem of reducing the volume oftraining data ...
research
08/15/2023

Multimodal Dataset Distillation for Image-Text Retrieval

Dataset distillation methods offer the promise of reducing a large-scale...
research
01/17/2023

Dataset Distillation: A Comprehensive Review

Recent success of deep learning can be largely attributed to the huge am...
research
09/19/2018

Exploring the Impact of Password Dataset Distribution on Guessing

Leaks from password datasets are a regular occurrence. An organization m...
research
09/29/2022

Compressed Gastric Image Generation Based on Soft-Label Dataset Distillation for Medical Data Sharing

Background and objective: Sharing of medical data is required to enable ...

Please sign up or login with your details

Forgot password? Click here to reset