Language comparison via network topology

07/16/2019
by   Blaž Škrlj, et al.
0

Modeling relations between languages can offer understanding of language characteristics and uncover similarities and differences between languages. Automated methods applied to large textual corpora can be seen as opportunities for novel statistical studies of language development over time, as well as for improving cross-lingual natural language processing techniques. In this work, we first propose how to represent textual data as a directed, weighted network by the text2net algorithm. We next explore how various fast, network-topological metrics, such as network community structure, can be used for cross-lingual comparisons. In our experiments, we employ eight different network topology metrics, and empirically showcase on a parallel corpus, how the methods can be used for modeling the relations between nine selected languages. We demonstrate that the proposed method scales to large corpora consisting of hundreds of thousands of aligned sentences on an of-the-shelf laptop. We observe that on the one hand properties such as communities, capture some of the known differences between the languages, while others can be seen as novel opportunities for linguistic studies.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/13/2023

PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity

One of the components of natural language processing that has received a...
research
12/31/2020

ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora

Recent studies have demonstrated that pre-trained cross-lingual models a...
research
09/24/2021

Monolingual and Cross-Lingual Acceptability Judgments with the Italian CoLA corpus

The development of automated approaches to linguistic acceptability has ...
research
01/11/2016

Trans-gram, Fast Cross-lingual Word-embeddings

We introduce Trans-gram, a simple and computationally-efficient method t...
research
05/07/2020

Fine-Grained Analysis of Cross-Linguistic Syntactic Divergences

The patterns in which the syntax of different languages converges and di...
research
09/16/2023

X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs

Understanding when two pieces of text convey the same information is a g...
research
06/30/2015

A complex network approach to stylometry

Statistical methods have been widely employed to study the fundamental p...

Please sign up or login with your details

Forgot password? Click here to reset