Data Smells: Categories, Causes and Consequences, and Detection of Suspicious Data in AI-based Systems

03/19/2022
by   Harald Foidl, et al.
0

High data quality is fundamental for today's AI-based systems. However, although data quality has been an object of research for decades, there is a clear lack of research on potential data quality issues (e.g., ambiguous, extraneous values). These kinds of issues are latent in nature and thus often not obvious. Nevertheless, they can be associated with an increased risk of future problems in AI-based systems (e.g., technical debt, data-induced faults). As a counterpart to code smells in software engineering, we refer to such issues as Data Smells. This article conceptualizes data smells and elaborates on their causes, consequences, detection, and use in the context of AI-based systems. In addition, a catalogue of 36 data smells divided into three categories (i.e., Believability Smells, Understandability Smells, Consistency Smells) is presented. Moreover, the article outlines tool support for detecting data smells and presents the result of an initial smell detection on more than 240 real-world datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/05/2021

Software Engineering for AI-Based Systems: A Survey

AI-based systems are software systems with functionalities enabled by at...
research
02/10/2021

Quality Assurance for AI-based Systems: Overview and Challenges

The number and importance of AI-based systems in all domains is growing....
research
01/12/2022

The Human Factor in AI Safety

AI-based systems have been used widely across various industries for dif...
research
07/19/2023

Towards green AI-based software systems: an architecture-centric approach (GAISSA)

Nowadays, AI-based systems have achieved outstanding results and have ou...
research
11/18/2022

Indexing AI Risks with Incidents, Issues, and Variants

Two years after publicly launching the AI Incident Database (AIID) as a ...
research
05/05/2022

Monitoring AI systems: A Problem Analysis, Framework and Outlook

Knowledge-based systems have been used to monitor machines and processes...
research
07/09/2023

Understanding Persistent-Memory Related Issues in the Linux Kernel

Persistent memory (PM) technologies have inspired a wide range of PM-bas...

Please sign up or login with your details

Forgot password? Click here to reset