A Distributed Automatic Domain-Specific Multi-Word Term Recognition Architecture using Spark Ecosystem

05/24/2023
by   Ciprian-Octavian Truică, et al.
0

Automatic Term Recognition is used to extract domain-specific terms that belong to a given domain. In order to be accurate, these corpus and language-dependent methods require large volumes of textual data that need to be processed to extract candidate terms that are afterward scored according to a given metric. To improve text preprocessing and candidate terms extraction and scoring, we propose a distributed Spark-based architecture to automatically extract domain-specific terms. The main contributions are as follows: (1) propose a novel distributed automatic domain-specific multi-word term recognition architecture built on top of the Spark ecosystem; (2) perform an in-depth analysis of our architecture in terms of accuracy and scalability; (3) design an easy-to-integrate Python implementation that enables the use of Big Data processing in fields such as Computational Linguistics and Natural Language Processing. We prove empirically the feasibility of our architecture by performing experiments on two real-world datasets.

READ FULL TEXT
research
01/22/2021

Unsupervised Technical Domain Terms Extraction using Term Extractor

Terminology extraction, also known as term extraction, is a subtask of i...
research
10/13/2021

FlexiTerm: A more efficient implementation of flexible multi-word term recognition

Terms are linguistic signifiers of domain-specific concepts. Automated r...
research
11/10/2020

MotePy: A domain specific language for low-overhead machine learning and data processing

A domain specific language (DSL), named MotePy is presented. The DSL off...
research
04/17/2023

Political corpus creation through automatic speech recognition on EU debates

In this paper, we present a transcribed corpus of the LIBE committee of ...
research
03/04/2018

Data Curation with Deep Learning [Vision]: Towards Self Driving Data Curation

Past. Data curation - the process of discovering, integrating, and clean...
research
11/09/2017

SemRe-Rank: Incorporating Semantic Relatedness to Improve Automatic Term Extraction Using Personalized PageRank

Automatic Term Extraction deals with the extraction of terminology from ...
research
11/23/2016

ATR4S: Toolkit with State-of-the-art Automatic Terms Recognition Methods in Scala

Automatically recognized terminology is widely used for various domain-s...

Please sign up or login with your details

Forgot password? Click here to reset