DaCy: A Unified Framework for Danish NLP

07/12/2021
by   Kenneth Enevoldsen, et al.
0

Danish natural language processing (NLP) has in recent years obtained considerable improvements with the addition of multiple new datasets and models. However, at present, there is no coherent framework for applying state-of-the-art models for Danish. We present DaCy: a unified framework for Danish NLP built on SpaCy. DaCy uses efficient multitask models which obtain state-of-the-art performance on named entity recognition, part-of-speech tagging, and dependency parsing. DaCy contains tools for easy integration of existing models such as for polarity, emotion, or subjectivity detection. In addition, we conduct a series of tests for biases and robustness of Danish NLP pipelines through augmentation of the test set of DaNE. DaCy large compares favorably and is especially robust to long input lengths and spelling variations and errors. All models except DaCy large display significant biases related to ethnicity while only Polyglot shows a significant gender bias. We argue that for languages with limited benchmark sets, data augmentation can be particularly useful for obtaining more realistic and fine-grained performance estimates. We provide a series of augmenters as a first step towards a more thorough evaluation of language models for low and medium resource languages and encourage further development.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/24/2023

Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

This paper presents a set of industrial-grade text processing models for...
research
11/10/2019

CamemBERT: a Tasty French Language Model

Pretrained language models are now ubiquitous in Natural Language Proces...
research
06/04/2019

Back Attention Knowledge Transfer for Low-resource Named Entity Recognition

In recent years, great success has been achieved in the field of natural...
research
06/14/2022

An Experimental Investigation of Part-Of-Speech Taggers for Vietnamese

Part-of-speech (POS) tagging plays an important role in Natural Language...
research
11/18/2021

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

Data-hungry deep neural networks have established themselves as the stan...
research
10/14/2021

Understanding Model Robustness to User-generated Noisy Texts

Sensitivity of deep-neural models to input noise is known to be a challe...
research
02/27/2020

Understanding and Enhancing Mixed Sample Data Augmentation

Mixed Sample Data Augmentation (MSDA) has received increasing attention ...

Please sign up or login with your details

Forgot password? Click here to reset