Dataset Distillation: A Comprehensive Review
Recent success of deep learning can be largely attributed to the huge amount of data used for training deep neural networks. However, the sheer amount of data significantly increase the burden on storage and transmission. It would also consume considerable time and computational resources to train models on such large datasets. Moreover, directly publishing raw data inevitably raise concerns on privacy and copyright. Focusing on these inconveniences, dataset distillation (DD), also known as dataset condensation (DC), has become a popular research topic in recent years. Given an original large dataset, DD aims at a much smaller dataset containing several synthetic samples, such that models trained on the synthetic dataset can have comparable performance with those trained on the original real one. This paper presents a comprehensive review and summary for recent advances in DD and its application. We first introduce the task in formal and propose an overall algorithmic framework followed by all existing DD methods. Then, we provide a systematic taxonomy of current methodologies in this area. Their theoretical relationship will also be discussed. We also point out current challenges in DD through extensive experiments and envision possible directions for future works.
READ FULL TEXT