Data Cleansing for Models Trained with SGD

06/20/2019
by   Satoshi Hara, et al.
3

Data cleansing is a typical approach used to improve the accuracy of machine learning models, which, however, requires extensive domain knowledge to identify the influential instances that affect the models. In this paper, we propose an algorithm that can suggest influential instances without using any domain knowledge. With the proposed method, users only need to inspect the instances suggested by the algorithm, implying that users do not need extensive knowledge for this procedure, which enables even non-experts to conduct data cleansing and improve the model. The existing methods require the loss function to be convex and an optimal model to be obtained, which is not always the case in modern machine learning. To overcome these limitations, we propose a novel approach specifically designed for the models trained with stochastic gradient descent (SGD). The proposed method infers the influential instances by retracing the steps of the SGD while incorporating intermediate models computed in each step. Through experiments, we demonstrate that the proposed method can accurately infer the influential instances. Moreover, we used MNIST and CIFAR10 to show that the models can be effectively improved by removing the influential instances suggested by the proposed method.

READ FULL TEXT

page 9

page 22

page 23

research
01/10/2020

Choosing the Sample with Lowest Loss makes SGD Robust

The presence of outliers can potentially significantly skew the paramete...
research
03/12/2020

Machine Learning on Volatile Instances

Due to the massive size of the neural network models and training datase...
research
01/16/2020

Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent

Machine learning has made tremendous progress in recent years, with mode...
research
03/15/2018

On the insufficiency of existing momentum schemes for Stochastic Optimization

Momentum based stochastic gradient methods such as heavy ball (HB) and N...
research
06/04/2021

Learning Curves for SGD on Structured Features

The generalization performance of a machine learning algorithm such as a...
research
06/27/2012

Incorporating Domain Knowledge in Matching Problems via Harmonic Analysis

Matching one set of objects to another is a ubiquitous task in machine l...
research
12/20/2016

Neuro-symbolic EDA-based Optimisation using ILP-enhanced DBNs

We investigate solving discrete optimisation problems using the estimati...

Please sign up or login with your details

Forgot password? Click here to reset