HELFI: a Hebrew-Greek-Finnish Parallel Bible Corpus with Cross-Lingual Morpheme Alignment

03/16/2020
by   Anssi Yli-Jyrä, et al.
0

Twenty-five years ago, morphologically aligned Hebrew-Finnish and Greek-Finnish bitexts (texts accompanied by a translation) were constructed manually in order to create an analytical concordance (Luoto et al., 1997) for a Finnish Bible translation. The creators of the bitexts recently secured the publisher's permission to release its fine-grained alignment, but the alignment was still dependent on proprietary, third-party resources such as a copyrighted text edition and proprietary morphological analyses of the source texts. In this paper, we describe a nontrivial editorial process starting from the creation of the original one-purpose database and ending with its reconstruction using only freely available text editions and annotations. This process produced an openly available dataset that contains (i) the source texts and their translations, (ii) the morphological analyses, (iii) the cross-lingual morpheme alignments.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/28/2014

Coarse-grained Cross-lingual Alignment of Comparable Texts with Topic Models and Encyclopedic Knowledge

We present a method for coarse-grained cross-lingual alignment of compar...
09/19/2018

Unsupervised cross-lingual matching of product classifications

Unsupervised cross-lingual embeddings mapping has provided a unique tool...
12/02/2015

Annotating Character Relationships in Literary Texts

We present a dataset of manually annotated relationships between charact...
05/18/2019

Cross-referencing using Fine-grained Topic Modeling

Cross-referencing, which links passages of text to other related passage...
09/13/2021

A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space

In cross-lingual language models, representations for many different lan...
03/12/2019

Bootstrapping Method for Developing Part-of-Speech Tagged Corpus in Low Resource Languages Tagset - A Focus on an African Igbo

Most languages, especially in Africa, have fewer or no established part-...
05/13/2020

Validation and Normalization of DCS corpus using Sanskrit Heritage tools to build a tagged Gold Corpus

The Digital Corpus of Sanskrit records around 650,000 sentences along wi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.