A New Approach for Semi-automatic Building and Extending a Multilingual Terminology Thesaurus

03/26/2019
by   Adam Rambousek, et al.
0

This paper describes a new system for semi-automatically building, extending and managing a terminological thesaurus---a multilingual terminology dictionary enriched with relationships between the terms themselves to form a thesaurus. The system allows to radically enhance the workflow of current terminology expert groups, where most of the editing decisions still come from introspection. The presented system supplements the lexicographic process with natural language processing techniques, which are seamlessly integrated to the thesaurus editing environment. The system's methodology and the resulting thesaurus are closely connected to new domain corpora in the six languages involved. They are used for term usage examples as well as for the automatic extraction of new candidate terms. The terminological thesaurus is now accessible via a web-based application, which a) presents rich detailed information on each term, b) visualizes term relations, and c) displays real-life usage examples of the term in the domain-related documents and in the context-based similar terms. Furthermore, the specialized corpora are used to detect candidate translations of terms from the central language (Czech) to the other languages (English, French, German, Russian and Slovak) as well as to detect broader Czech terms, which help to place new terms in the actual thesaurus hierarchy. This project has been realized as a terminological thesaurus of land surveying, but the presented tools and methodology are reusable for other terminology domains.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/12/2022

Ensembling Transformers for Cross-domain Automatic Term Extraction

Automatic term extraction plays an essential role in domain language und...
research
03/22/2021

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

With the success of large-scale pre-training and multilingual modeling i...
research
10/03/2017

MMCR4NLP: Multilingual Multiway Corpora Repository for Natural Language Processing

Multilinguality is gradually becoming ubiquitous in the sense that more ...
research
10/07/2020

Improving Sentiment Analysis over non-English Tweets using Multilingual Transformers and Automatic Translation for Data-Augmentation

Tweets are specific text data when compared to general text. Although se...
research
09/03/2022

Multilingual ColBERT-X

ColBERT-X is a dense retrieval model for Cross Language Information Retr...
research
02/10/2023

Building cross-language corpora for human understanding of privacy policies

Making sure that users understand privacy policies that impact them is a...

Please sign up or login with your details

Forgot password? Click here to reset