Snorkel: Rapid Training Data Creation with Weak Supervision

11/28/2017
∙
by   Alexander Ratner, et al.
∙
0
∙

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8x faster and increase predictive performance an average 45.5 this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8x speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132 performance over prior heuristic approaches and comes within an average 3.60 of the predictive performance of large hand-curated training sets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
∙ 09/03/2020

Data Programming by Demonstration: A Framework for Interactively Learning Labeling Functions

Data programming is a programmatic weak supervision approach to efficien...
research
∙ 03/11/2019

GOGGLES: Automatic Training Data Generation with Affinity Coding

Generating large labeled training data is becoming the biggest bottlenec...
research
∙ 05/29/2023

Alfred: A System for Prompted Weak Supervision

Alfred is the first system for programmatic weak supervision (PWS) that ...
research
∙ 03/02/2022

Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming

Weak Supervision (WS) techniques allow users to efficiently create large...
research
∙ 05/25/2016

Data Programming: Creating Large Training Sets, Quickly

Large labeled training sets are the critical building blocks of supervis...
research
∙ 07/05/2021

End-to-End Weak Supervision

Aggregating multiple sources of weak supervision (WS) can ease the data-...
research
∙ 03/26/2019

Cross-Modal Data Programming Enables Rapid Medical Machine Learning

Labeling training datasets has become a key barrier to building medical ...

Please sign up or login with your details

Forgot password? Click here to reset