Quality-Efficiency Trade-offs in Machine Learning for Text Processing

11/07/2017
by   Ricardo Baeza-Yates, et al.
0

Data mining, machine learning, and natural language processing are powerful techniques that can be used together to extract information from large texts. Depending on the task or problem at hand, there are many different approaches that can be used. The methods available are continuously being optimized, but not all these methods have been tested and compared in a set of problems that can be solved using supervised machine learning algorithms. The question is what happens to the quality of the methods if we increase the training data size from, say, 100 MB to over 1 GB? Moreover, are quality gains worth it when the rate of data processing diminishes? Can we trade quality for time efficiency and recover the quality loss by just being able to process more data? We attempt to answer these questions in a general way for text processing tasks, considering the trade-offs involving training data size, learning time, and quality obtained. We propose a performance trade-off framework and apply it to three important text processing problems: Named Entity Recognition, Sentiment Analysis and Document Classification. These problems were also chosen because they have different levels of object granularity: words, paragraphs, and documents. For each problem, we selected several supervised machine learning algorithms and we evaluated the trade-offs of them on large publicly available data sets (news, reviews, patents). To explore these trade-offs, we use different data subsets of increasing size ranging from 50 MB to several GB. We also consider the impact of the data set and the evaluation technique. We find that the results do not change significantly and that most of the time the best algorithms is the fastest. However, we also show that the results for small data (say less than 100 MB) are different from the results for big data and in those cases the best algorithm is much harder to determine.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/21/2020

Sentiment Analysis in Drug Reviews using Supervised Machine Learning Algorithms

Sentiment Analysis is an important algorithm in Natural Language Process...
research
07/09/2020

The Trade-Offs of Private Prediction

Machine learning models leak information about their training data every...
research
05/13/2021

Explainable Machine Learning for Fraud Detection

The application of machine learning to support the processing of large d...
research
11/18/2019

GPT Conjecture: Understanding the Trade-offs between Granularity, Performance and Timeliness in Control-Flow Integrity

Performance/security trade-off is widely noticed in CFI research, howeve...
research
02/08/2021

The Limits of Computation in Solving Equity Trade-Offs in Machine Learning and Justice System Risk Assessment

This paper explores how different ideas of racial equity in machine lear...
research
10/05/2018

GraphBolt: Streaming Graph Approximations on Big Data

Graphs are found in a plethora of domains, including online social netwo...
research
08/07/2020

Privacy Guarantees for De-identifying Text Transformations

Machine Learning approaches to Natural Language Processing tasks benefit...

Please sign up or login with your details

Forgot password? Click here to reset