Measuring Spark on AWS: A Case Study on Mining Scientific Publications with Annotation Query

02/02/2018
by   Darin McBeath, et al.
0

Annotation Query (AQ) is a program that provides the ability to query many different types of NLP annotations on a text, as well as the original content and structure of the text. The query results may provide new annotations, or they may select subsets of the content and annotations for deeper processing. Like GATE's Mimir, AQ is based on region algebras. Our AQ is implemented to run on a Spark cluster. In this paper we look at how AQ's runtimes are affected by the size of the collection, the number of nodes in the cluster, the type of node, and the characteristics of the queries. Cluster size, of course, makes a large difference in performance so long as skew can be avoided. We find that there is minimal difference in performance when persisting annotations serialized to local SSD drives as opposed to deserialized into local memory. We also find that if the number of nodes is kept constant, then AWS' storage-optimized instance performs the best. But if we factor in total cost, the compute-optimized nodes provides the best performance relative to cost.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/08/2019

LocationSpark: In-memory Distributed Spatial Query Processing and Optimization

Due to the ubiquity of spatial data applications and the large amounts o...
research
07/08/2019

In-memory Distributed Spatial Query Processing and Optimization

Due to the ubiquity of spatial data applications and the large amounts o...
research
05/10/2016

The Yahoo Query Treebank, V. 1.0

A description and annotation guidelines for the Yahoo Webscope release o...
research
08/11/2017

Break it Down for Me: A Study in Automated Lyric Annotation

Comprehending lyrics, as found in songs and poems, can pose a challenge ...
research
09/23/2021

A Survey on Cost Types, Interaction Schemes, and Annotator Performance Models in Selection Algorithms for Active Learning in Classification

Pool-based active learning (AL) aims to optimize the annotation process ...
research
02/10/2018

Distributed NLP

In this paper we present the performance of parallel text processing wit...

Please sign up or login with your details

Forgot password? Click here to reset