A Benchmark Generative Probabilistic Model for Weak Supervised Learning

03/31/2023
by   Georgios Papadopoulos, et al.
0

Finding relevant and high-quality datasets to train machine learning models is a major bottleneck for practitioners. Furthermore, to address ambitious real-world use-cases there is usually the requirement that the data come labelled with high-quality annotations that can facilitate the training of a supervised model. Manually labelling data with high-quality labels is generally a time-consuming and challenging task and often this turns out to be the bottleneck in a machine learning project. Weak Supervised Learning (WSL) approaches have been developed to alleviate the annotation burden by offering an automatic way of assigning approximate labels (pseudo-labels) to unlabelled data based on heuristics, distant supervision and knowledge bases. We apply probabilistic generative latent variable models (PLVMs), trained on heuristic labelling representations of the original dataset, as an accurate, fast and cost-effective way to generate pseudo-labels. We show that the PLVMs achieve state-of-the-art performance across four datasets. For example, they achieve 22 PLVMs are plug-and-playable and are a drop-in replacement to existing WSL frameworks (e.g. Snorkel) or they can be used as benchmark models for more complicated algorithms, giving practitioners a compelling accuracy boost.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/05/2022

A Comparison of Automatic Labelling Approaches for Sentiment Analysis

Labelling a large quantity of social media data for the task of supervis...
research
03/22/2022

Pseudo Label Is Better Than Human Label

State-of-the-art automatic speech recognition (ASR) systems are trained ...
research
12/17/2021

A data-centric weak supervised learning for highway traffic incident detection

Using the data from loop detector sensors for near-real-time detection o...
research
05/30/2022

Conformal Credal Self-Supervised Learning

In semi-supervised learning, the paradigm of self-training refers to the...
research
03/23/2022

A Framework for Fast Polarity Labelling of Massive Data Streams

Many of the existing sentiment analysis techniques are based on supervis...
research
10/13/2022

ComSearch: Equation Searching with Combinatorial Strategy for Solving Math Word Problems with Weak Supervision

Previous studies have introduced a weakly-supervised paradigm for solvin...
research
07/16/2021

Pseudo-labelling Enhanced Media Bias Detection

Leveraging unlabelled data through weak or distant supervision is a comp...

Please sign up or login with your details

Forgot password? Click here to reset