A Survey on Data Collection for Machine Learning: a Big Data - AI Integration Perspective

11/08/2018
by   Yuji Roh, et al.
1

Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning where feature engineering is the bottleneck, deep learning techniques automatically generate features, but instead require large amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/13/2021

Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

Software 2.0 is a fundamental shift in software engineering where machin...
research
09/28/2020

Reactive Supervision: A New Method for Collecting Sarcasm Data

Sarcasm detection is an important task in affective computing, requiring...
research
02/05/2021

Applications of Machine Learning in Document Digitisation

Data acquisition forms the primary step in all empirical research. The a...
research
05/15/2023

New methods for new data? An overview and illustration of quantitative inductive methods for HRM research

"Data is the new oil", in short, data would be the essential source of t...
research
04/22/2019

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

The wide use of machine learning is fundamentally changing the software ...
research
03/15/2022

Machine Learning and Cosmology

Methods based on machine learning have recently made substantial inroads...
research
09/08/2021

A Survey on Machine Learning Techniques for Auto Labeling of Video, Audio, and Text Data

Machine learning has been utilized to perform tasks in many different do...

Please sign up or login with your details

Forgot password? Click here to reset