Repairing Systematic Outliers by Learning Clean Subspaces in VAEs

07/17/2022
by   Simão Eduardo, et al.
5

Data cleaning often comprises outlier detection and data repair. Systematic errors result from nearly deterministic transformations that occur repeatedly in the data, e.g. specific image pixels being set to default values or watermarks. Consequently, models with enough capacity easily overfit to these errors, making detection and repair difficult. Seeing as a systematic outlier is a combination of patterns of a clean instance and systematic error patterns, our main insight is that inliers can be modelled by a smaller representation (subspace) in a model than outliers. By exploiting this, we propose Clean Subspace Variational Autoencoder (CLSVAE), a novel semi-supervised model for detection and automated repair of systematic errors. The main idea is to partition the latent space and model inlier and outlier patterns separately. CLSVAE is effective with much less labelled data compared to previous related models, often with less than 2 image datasets in scenarios with different levels of corruption and labelled set sizes, comparing to relevant baselines. CLSVAE provides superior repairs without human intervention, e.g. with just 0.25 relative error decrease of 58

READ FULL TEXT

page 22

page 27

page 28

page 29

page 30

page 31

page 32

page 33

research
07/15/2019

Robust Variational Autoencoders for Outlier Detection in Mixed-Type Data

We focus on the problem of unsupervised cell outlier detection in mixed ...
research
11/03/2017

BoostClean: Automated Error Detection and Repair for Machine Learning

Predictive models based on machine learning can be highly sensitive to d...
research
04/05/2020

Learning Over Dirty Data Without Cleaning

Real-world datasets are dirty and contain many errors. Examples of these...
research
09/08/2011

Exact Subspace Segmentation and Outlier Detection by Low-Rank Representation

In this work, we address the following matrix recovery problem: suppose ...
research
07/01/2021

The Spotlight: A General Method for Discovering Systematic Errors in Deep Learning Models

Supervised learning models often make systematic errors on rare subsets ...
research
05/17/2017

REMIX: Automated Exploration for Interactive Outlier Detection

Outlier detection is the identification of points in a dataset that do n...
research
10/18/2021

Label-Descriptive Patterns and their Application to Characterizing Classification Errors

State-of-the-art deep learning methods achieve human-like performance on...

Please sign up or login with your details

Forgot password? Click here to reset