Dealing with Abbreviations in the Slovenian Biographical Lexicon

11/04/2022
by   Angel Daza, et al.
0

Abbreviations present a significant challenge for NLP systems because they cause tokenization and out-of-vocabulary errors. They can also make the text less readable, especially in reference printed books, where they are extensively used. Abbreviations are especially problematic in low-resource settings, where systems are less robust to begin with. In this paper, we propose a new method for addressing the problems caused by a high density of domain-specific abbreviations in a text. We apply this method to the case of a Slovenian biographical lexicon and evaluate it on a newly developed gold-standard dataset of 51 Slovenian biographies. Our abbreviation identification method performs significantly better than commonly used ad-hoc solutions, especially at identifying unseen abbreviations. We also propose and present the results of a method for expanding the identified abbreviations in context.

READ FULL TEXT
research
03/31/2021

Domain-specific MT for Low-resource Languages: The case of Bambara-French

Translating to and from low-resource languages is a challenge for machin...
research
02/04/2020

A reactive algorithm for deducing nodal forwarding behavior in a multihop ad-hoc wireless network in the presence of errors

novel algorithm is presented to deduce individual nodal forwarding behav...
research
07/13/2018

Low-Resource Text Classification using Domain-Adversarial Learning

Deep learning techniques have recently shown to be successful in many na...
research
05/29/2019

Choosing Transfer Languages for Cross-Lingual Learning

Cross-lingual transfer, where a high-resource transfer language is used ...
research
06/12/2023

On the N-gram Approximation of Pre-trained Language Models

Large pre-trained language models (PLMs) have shown remarkable performan...
research
04/29/2018

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Subword units are an effective way to alleviate the open vocabulary prob...
research
04/22/2022

A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning

Subword tokenization is a commonly used input pre-processing step in mos...

Please sign up or login with your details

Forgot password? Click here to reset