Slice Tuner: A Selective Data Collection Framework for Accurate and Fair Machine Learning Models

03/10/2020
by   Ki Hyun Tae, et al.
0

As machine learning becomes democratized in the era of Software 2.0, one of the most serious bottlenecks is collecting enough labeled data to ensure accurate and fair models. Recent techniques including crowdsourcing provide cost-effective ways to gather such data. However, simply collecting data as much as possible is not necessarily an effective strategy for optimizing accuracy and fairness. For example, if an online app store has enough training data for certain slices of data (say American customers), but not for others, collecting more American customer data will only bias the model training. Instead, we contend that one needs to selectively collect data and propose Slice Tuner, which collects possibly-different amounts of data per slice such that the model accuracy and fairness on all slices are optimized. At its core, Slice Tuner maintains learning curves of slices that estimate the model accuracies given more data and uses convex optimization to find the best data collection strategy. The key challenges of estimating learning curves are that they may be inaccurate if there is not enough data, and there may be dependencies among slices where collecting data for one slice influences the learning curves of others. We solve these issues by iteratively and efficiently updating the learning curves as more data is collected. We evaluate Slice Tuner on real datasets using crowdsourcing for data collection and show that Slice Tuner significantly outperforms baselines in terms of model accuracy and fairness, even for initially small slices. We believe Slice Tuner is a practical tool for suggesting concrete action items based on model analysis.

READ FULL TEXT
research
01/15/2021

Responsible AI Challenges in End-to-end Machine Learning

Responsible AI is becoming critical as AI is widely used in our everyday...
research
06/04/2021

Learning Slice-Aware Representations with Mixture of Attentions

Real-world machine learning systems are achieving remarkable performance...
research
12/13/2021

Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

Software 2.0 is a fundamental shift in software engineering where machin...
research
09/13/2019

Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices

In real-world machine learning applications, data subsets correspond to ...
research
12/13/2017

Ballpark Crowdsourcing: The Wisdom of Rough Group Comparisons

Crowdsourcing has become a popular method for collecting labeled trainin...
research
06/01/2023

MonArch: Network Slice Monitoring Architecture for Cloud Native 5G Deployments

Automated decision making algorithms are expected to play a key role in ...
research
10/28/2022

Addressing Bias in Face Detectors using Decentralised Data collection with incentives

Recent developments in machine learning have shown that successful model...

Please sign up or login with your details

Forgot password? Click here to reset