Legal document retrieval across languages: topic hierarchies based on synsets

11/28/2019
by   Carlos Badenes-Olmedo, et al.
0

Cross-lingual annotations of legislative texts enable us to explore major themes covered in multilingual legal data and are a key facilitator of semantic similarity when searching for similar documents. Multilingual probabilistic topic models have recently emerged as a group of semi-supervised machine learning models that can be used to perform thematic explorations on collections of texts in multiple languages. However, these approaches require theme-aligned training data to create a language-independent space, which limits the amount of scenarios where this technique can be used. In this work, we provide an unsupervised document similarity algorithm based on hierarchies of multi-lingual concepts to describe topics across languages. The algorithm does not require parallel or comparable corpora, or any other type of translation resource. Experiments performed on the English, Spanish, French and Portuguese editions of JCR-Acquis corpora reveal promising results on classifying and sorting documents by similar content.

READ FULL TEXT

page 1

page 4

page 5

research
12/15/2020

Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

With the ongoing growth in number of digital articles in a wider set of ...
research
11/10/2019

A Massive Collection of Cross-Lingual Web-Document Pairs

Cross-lingual document alignment aims to identify pairs of documents in ...
research
01/31/2020

Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover's Distance

Cross-lingual document alignment aims to identify pairs of documents in ...
research
07/15/2020

A Multilingual Parallel Corpora Collection Effort for Indian Languages

We present sentence aligned parallel corpora across 10 Indian Languages ...
research
06/11/2018

Learning Multilingual Topics from Incomparable Corpus

Multilingual topic models enable crosslingual tasks by extracting consis...
research
11/30/2021

Bilingual Topic Models for Comparable Corpora

Probabilistic topic models like Latent Dirichlet Allocation (LDA) have b...
research
08/02/2021

PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors

EuroVoc is a multilingual thesaurus that was built for organizing the le...

Please sign up or login with your details

Forgot password? Click here to reset