On the experiences of adopting automated data validation in an industrial machine learning project

03/06/2021
by   Lucy Ellen Lwakatare, et al.
0

Background: Data errors are a common challenge in machine learning (ML) projects and generally cause significant performance degradation in ML-enabled software systems. To ensure early detection of erroneous data and avoid training ML models using bad data, research and industrial practice suggest incorporating a data validation process and tool in ML system development process. Aim: The study investigates the adoption of a data validation process and tool in industrial ML projects. The data validation process demands significant engineering resources for tool development and maintenance. Thus, it is important to identify the best practices for their adoption especially by development teams that are in the early phases of deploying ML-enabled software systems. Method: Action research was conducted at a large-software intensive organization in telecommunications, specifically within the analytics R&D organization for an ML use case of classifying faults from returned hardware telecommunication devices. Results: Based on the evaluation results and learning from our action research, we identified three best practices, three benefits, and two barriers to adopting the data validation process and tool in ML projects. We also propose a data validation framework (DVF) for systematizing the adoption of a data validation process. Conclusions: The results show that adopting a data validation process and tool in ML projects is an effective approach of testing ML-enabled software systems. It requires having an overview of the level of data (feature, dataset, cross-dataset, data stream) at which certain data quality tests can be applied.

READ FULL TEXT

page 1

page 8

research
09/23/2022

A Preliminary Investigation of MLOps Practices in GitHub

Background. The rapid and growing popularity of machine learning (ML) ap...
research
05/03/2021

Quality Assurance Challenges for Machine Learning Software Applications During Software Development Life Cycle Phases

In the past decades, the revolutionary advances of Machine Learning (ML)...
research
08/04/2022

Development and Validation of ML-DQA – a Machine Learning Data Quality Assurance Framework for Healthcare

The approaches by which the machine learning and clinical research commu...
research
11/23/2020

Resonance: Replacing Software Constants with Context-Aware Models in Real-time Communication

Large software systems tune hundreds of 'constants' to optimize their ru...
research
06/01/2022

Studying the Practices of Deploying Machine Learning Projects on Docker

Docker is a containerization service that allows for convenient deployme...
research
06/30/2021

A Structured Analysis of the Video Degradation Effects on the Performance of a Machine Learning-enabled Pedestrian Detector

ML-enabled software systems have been incorporated in many public demons...
research
01/29/2021

Causal Factors, Benefits and Challenges of Test-Driven Development: Practitioner Perceptions

This report describes the experiences of one organization's adoption of ...

Please sign up or login with your details

Forgot password? Click here to reset