A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration

10/06/2020
by   Avi Shmidman, et al.
0

One of the primary tasks of morphological parsers is the disambiguation of homographs. Particularly difficult are cases of unbalanced ambiguity, where one of the possible analyses is far more frequent than the others. In such cases, there may not exist sufficient examples of the minority analyses in order to properly evaluate performance, nor to train effective classifiers. In this paper we address the issue of unbalanced morphological ambiguities in Hebrew. We offer a challenge set for Hebrew homographs – the first of its kind – containing substantial attestation of each analysis of 21 Hebrew homographs. We show that the current SOTA of Hebrew disambiguation performs poorly on cases of unbalanced ambiguity. Leveraging our new dataset, we achieve a new state-of-the-art for all 21 words, improving the overall average F1 score from 0.67 to 0.95. Our resulting annotated datasets are made publicly available for further research.

READ FULL TEXT
research
01/30/2022

Word Segmentation and Morphological Parsing for Sanskrit

We describe our participation in the Word Segmentation and Morphological...
research
04/08/2021

User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization

Morphological analysis (MA) and lexical normalization (LN) are both impo...
research
07/12/2020

Neural disambiguation of lemma and part of speech in morphologically rich languages

We consider the problem of disambiguating the lemma and part of speech o...
research
09/06/2019

Don't Forget the Long Tail! A Comprehensive Analysis of Morphological Generalization in Bilingual Lexicon Induction

Human translators routinely have to translate rare inflections of words ...
research
03/25/2017

Morphological Analysis for the Maltese Language: The Challenges of a Hybrid System

Maltese is a morphologically rich language with a hybrid morphological s...
research
10/24/2020

A Benchmark Corpus and Neural Approach for Sanskrit Derivative Nouns Analysis

This paper presents first benchmark corpus of Sanskrit Pratyaya (suffix)...
research
08/15/2023

ChartDETR: A Multi-shape Detection Network for Visual Chart Recognition

Visual chart recognition systems are gaining increasing attention due to...

Please sign up or login with your details

Forgot password? Click here to reset