A Part-of-Speech Tagger for Yiddish: First Steps in Tagging the Yiddish Book Center Corpus

04/03/2022
by   Seth Kulick, et al.
0

We describe the construction and evaluation of a part-of-speech tagger for Yiddish (the first one, to the best of our knowledge). This is the first step in a larger project of automatically assigning part-of-speech tags and syntactic structure to Yiddish text for purposes of linguistic research. We combine two resources for the current work - an 80K word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) (Santorini, 2021) and 650 million words of OCR'd Yiddish text from the Yiddish Book Center (YBC). We compute word embeddings on the YBC corpus, and these embeddings are used with a tagger model trained and evaluated on the PPCHY. Yiddish orthography in the YBC corpus has many spelling inconsistencies, and we present some evidence that even simple non-contextualized embeddings are able to capture the relationships among spelling variants without the need to first "standardize" the corpus. We evaluate the tagger performance on a 10-fold cross-validation split, with and without the embeddings, showing that the embeddings improve tagger performance. However, a great deal of work remains to be done, and we conclude by discussing some next steps, including the need for additional annotated training and test data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2020

Parsing Early Modern English for Linguistic Search

We investigate the question of whether advances in NLP over the last few...
research
07/24/2017

Improve Lexicon-based Word Embeddings By Word Sense Disambiguation

There have been some works that learn a lexicon together with the corpus...
research
11/21/2021

More Romanian word embeddings from the RETEROM project

Automatically learned vector representations of words, also known as "wo...
research
11/26/2019

Feature-Rich Part-of-speech Tagging for Morphologically Complex Languages: Application to Bulgarian

We present experiments with part-of-speech tagging for Bulgarian, a Slav...
research
12/14/2022

AsPOS: Assamese Part of Speech Tagger using Deep Learning Approach

Part of Speech (POS) tagging is crucial to Natural Language Processing (...
research
04/16/2020

Kvistur 2.0: a BiLSTM Compound Splitter for Icelandic

In this paper, we present a character-based BiLSTM model for splitting I...
research
07/11/2022

TArC: Tunisian Arabish Corpus First complete release

In this paper we present the final result of a project on Tunisian Arabi...

Please sign up or login with your details

Forgot password? Click here to reset