Data Programming: Creating Large Training Sets, Quickly

05/25/2016
∙
by   Alexander Ratner, et al.
∙
0
∙

Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict. We show that by explicitly representing this training set labeling process as a generative model, we can "denoise" the generated training set, and establish theoretically that we can recover the parameters of these generative models in a handful of settings. We then show how to modify a discriminative loss function to make it noise-aware, and demonstrate our method over a range of discriminative models including logistic regression and LSTMs. Experimentally, on the 2014 TAC-KBP Slot Filling challenge, we show that data programming would have led to a new winning score, and also show that applying data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points over a state-of-the-art LSTM baseline (and into second place in the competition). Additionally, in initial user studies we observed that data programming may be an easier way for non-experts to create machine learning models when training data is limited or unavailable.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
∙ 10/25/2016

Socratic Learning: Augmenting Generative Models to Incorporate Latent Subsets in Training Data

A challenge in training discriminative models like neural networks is ob...
research
∙ 11/22/2019

Data Programming using Continuous and Quality-Guided Labeling Functions

Scarcity of labeled data is a bottleneck for supervised learning models....
research
∙ 09/03/2020

Data Programming by Demonstration: A Framework for Interactively Learning Labeling Functions

Data programming is a programmatic weak supervision approach to efficien...
research
∙ 06/24/2021

TagRuler: Interactive Tool for Span-Level Data Programming by Demonstration

Despite rapid developments in the field of machine learning research, co...
research
∙ 02/04/2020

Iterative Data Programming for Expanding Text Classification Corpora

Real-world text classification tasks often require many labeled training...
research
∙ 11/28/2017

Snorkel: Rapid Training Data Creation with Weak Supervision

Labeling training data is increasingly the largest bottleneck in deployi...
research
∙ 04/07/2020

Inspector Gadget: A Data Programming-based Labeling System for Industrial Images

As machine learning for images becomes democratized in the Software 2.0 ...

Please sign up or login with your details

Forgot password? Click here to reset