Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

08/24/2023
by   György Orosz, et al.
0

This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive performance of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under a permissive license.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/08/2019

Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER

Contextualized embeddings, which capture appropriate word meaning depend...
research
07/12/2021

DaCy: A Unified Framework for Danish NLP

Danish natural language processing (NLP) has in recent years obtained co...
research
08/01/2017

Improving Part-of-Speech Tagging for NLP Pipelines

This paper outlines the results of sentence level linguistics based rule...
research
01/09/2021

Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing

We introduce Trankit, a light-weight Transformer-based Toolkit for multi...
research
03/21/2022

Neural Token Segmentation for High Token-Internal Complexity

Tokenizing raw texts into word units is an essential pre-processing step...
research
08/15/2019

What's Wrong with Hebrew NLP? And How to Make it Right

For languages with simple morphology, such as English, automatic annotat...
research
03/02/2018

Autostacker: A Compositional Evolutionary Learning System

We introduce an automatic machine learning (AutoML) modeling architectur...

Please sign up or login with your details

Forgot password? Click here to reset