Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection

08/22/2023
by   Oana Inel, et al.
0

The rapid entry of machine learning approaches in our daily activities and high-stakes domains demands transparency and scrutiny of their fairness and reliability. To help gauge machine learning models' robustness, research typically focuses on the massive datasets used for their deployment, e.g., creating and maintaining documentation for understanding their origin, process of development, and ethical considerations. However, data collection for AI is still typically a one-off practice, and oftentimes datasets collected for a certain purpose or application are reused for a different problem. Additionally, dataset annotations may not be representative over time, contain ambiguous or erroneous annotations, or be unable to generalize across issues or domains. Recent research has shown these practices might lead to unfair, biased, or inaccurate outcomes. We argue that data collection for AI should be performed in a responsible manner where the quality of the data is thoroughly scrutinized and measured through a systematic set of appropriate metrics. In this paper, we propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics for an iterative in-depth analysis of the factors influencing the quality and reliability of the generated data. We propose a granular set of measurements to inform on the internal reliability of a dataset and its external stability over time. We validate our approach across nine existing datasets and annotation tasks and four content modalities. This approach impacts the assessment of data robustness used for AI applied in the real world, where diversity of users and content is eminent. Furthermore, it deals with fairness and accountability aspects in data collection by providing systematic and transparent quality analysis for data collections.

READ FULL TEXT

page 18

page 19

research
12/22/2019

Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning

A growing body of work shows that many problems in fairness, accountabil...
research
04/03/2022

Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI

As research and industry moves towards large-scale models capable of num...
research
12/13/2021

Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

Software 2.0 is a fundamental shift in software engineering where machin...
research
08/24/2023

EgoBlur: Responsible Innovation in Aria

Project Aria pushes the frontiers of Egocentric AI with large-scale real...
research
05/03/2023

Considerations for Ethical Speech Recognition Datasets

Speech AI Technologies are largely trained on publicly available dataset...
research
01/15/2021

Responsible AI Challenges in End-to-end Machine Learning

Responsible AI is becoming critical as AI is widely used in our everyday...
research
12/31/2020

Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection

We present a first-of-its-kind large synthetic training dataset for onli...

Please sign up or login with your details

Forgot password? Click here to reset