Selection Via Proxy: Efficient Data Selection For Deep Learning

06/26/2019
by   Cody Coleman, et al.
12

Data selection methods such as active learning and core-set selection are useful tools for machine learning on large datasets, but they can be prohibitively expensive to apply in deep learning. Unlike in other areas of machine learning, the feature representations that these techniques depend on are learned in deep learning rather than given, which takes a substantial amount of training time. In this work, we show that we can significantly improve the computational efficiency of data selection in deep learning by using a much smaller proxy model to perform data selection for tasks that will eventually require a large target model (e.g., selecting data points to label for active learning). In deep learning, we can scale down models by removing hidden layers or reducing their dimension to create proxies that are an order of magnitude faster. Although these small proxy models have significantly higher error, we find that they empirically provide useful rankings for data selection that have a high correlation with those of larger models. We evaluate this "selection via proxy" (SVP) approach on several data selection tasks. For active learning, applying SVP to Sener and Savarese [2018]'s recent method for active learning in deep learning gives a 4x improvement in execution time while yielding the same model accuracy. For core-set selection, we show that a proxy model that trains 10x faster than a target ResNet164 model on CIFAR10 can be used to remove 50 the target model, making end-to-end training time improvements via core-set selection possible.

READ FULL TEXT

page 7

page 13

research
03/08/2017

Deep Bayesian Active Learning with Image Data

Even though active learning forms an important pillar of machine learnin...
research
01/23/2023

Speeding Up BatchBALD: A k-BALD Family of Approximations for Active Learning

Active learning is a powerful method for training machine learning model...
research
11/15/2021

Reducing the Long Tail Losses in Scientific Emulations with Active Learning

Deep-learning-based models are increasingly used to emulate scientific s...
research
08/10/2023

Composable Core-sets for Diversity Approximation on Multi-Dataset Streams

Core-sets refer to subsets of data that maximize some function that is c...
research
07/06/2021

Prioritized training on points that are learnable, worth learning, and not yet learned

We introduce Goldilocks Selection, a technique for faster model training...
research
12/19/2020

GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning

Large scale machine learning and deep models are extremely data-hungry. ...
research
06/27/2018

Data Efficient Lithography Modeling with Transfer Learning and Active Data Selection

Lithography simulation is one of the key steps in physical verification,...

Please sign up or login with your details

Forgot password? Click here to reset