HuSpaCy: an industrial-strength Hungarian natural language processing toolkit

01/06/2022
by   György Orosz, et al.
0

Although there are a couple of open-source language processing pipelines available for Hungarian, none of them satisfies the requirements of today's NLP applications. A language processing pipeline should consist of close to state-of-the-art lemmatization, morphosyntactic analysis, entity recognition and word embeddings. Industrial text processing applications have to satisfy non-functional software quality requirements, what is more, frameworks supporting multiple languages are more and more favored. This paper introduces HuSpaCy, an industry-ready Hungarian language processing toolkit. The presented tool provides components for the most important basic linguistic analysis tasks. It is open-source and is available under a permissive license. Our system is built upon spaCy's NLP components resulting in an easily usable, fast yet accurate application. Experiments confirm that HuSpaCy has high accuracy while maintaining resource-efficient prediction capabilities.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/04/2018

VnCoreNLP: A Vietnamese Natural Language Processing Toolkit

We present an easy-to-use and fast toolkit, namely VnCoreNLP---a Java NL...
research
03/16/2020

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

We introduce Stanza, an open-source Python natural language processing t...
research
06/14/2021

Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python

Despite impressive success of machine learning algorithms in clinical na...
research
10/26/2018

Magnitude: A Fast, Efficient Universal Vector Embedding Utility Package

Vector space embedding models like word2vec, GloVe, fastText, and ELMo a...
research
12/15/2018

Wikipedia2Vec: An Optimized Implementation for Learning Embeddings from Wikipedia

We present Wikipedia2Vec, an open source tool for learning embeddings of...
research
10/14/2020

fugashi, a Tool for Tokenizing Japanese in Python

Recent years have seen an increase in the number of large-scale multilin...
research
03/02/2021

A Data-Centric Framework for Composable NLP Workflows

Empirical natural language processing (NLP) systems in application domai...

Please sign up or login with your details

Forgot password? Click here to reset