MultiWiki: Interlingual Text Passage Alignment in Wikipedia

05/21/2019
by   Simon Gottschalk, et al.
0

In this article we address the problem of text passage alignment across interlingual article pairs in Wikipedia. We develop methods that enable the identification and interlinking of text passages written in different languages and containing overlapping information. Interlingual text passage alignment can enable Wikipedia editors and readers to better understand language-specific context of entities, provide valuable insights in cultural differences and build a basis for qualitative analysis of the articles. An important challenge in this context is the trade-off between the granularity of the extracted text passages and the precision of the alignment. Whereas short text passages can result in more precise alignment, longer text passages can facilitate a better overview of the differences in an article pair. To better understand these aspects from the user perspective, we conduct a user study at the example of the German, Russian and the English Wikipedia and collect a user-annotated benchmark. Then we propose MultiWiki -- a method that adopts an integrated approach to the text passage alignment using semantic similarity measures and greedy algorithms and achieves precise results with respect to the user-defined alignment. MultiWiki demonstration is publicly available and currently supports four language pairs.

READ FULL TEXT
research
02/02/2017

Analysing Temporal Evolution of Interlingual Wikipedia Article Pairs

Wikipedia articles representing an entity or a topic in different langua...
research
01/23/2018

The WiLI benchmark dataset for written language identification

This paper describes the WiLI-2018 benchmark dataset for monolingual wri...
research
04/27/2015

Exploring semantically-related concepts from Wikipedia: the case of SeRE

In this paper we present our web application SeRE designed to explore se...
research
01/23/2019

Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions

In this paper we present the Wikipedia Cultural Diversity dataset. For e...
research
05/04/2020

WikiUMLS: Aligning UMLS to Wikipedia via Cross-lingual Neural Ranking

We present our work on aligning the Unified Medical Language System (UML...
research
07/31/2018

Neural Article Pair Modeling for Wikipedia Sub-article Matching

Nowadays, editors tend to separate different subtopics of a long Wiki-pe...
research
04/16/2019

Subjective Assessment of Text Complexity: A Dataset for German Language

This paper presents TextComplexityDE, a dataset consisting of 1000 sente...

Please sign up or login with your details

Forgot password? Click here to reset