Bilingual Document Alignment with Latent Semantic Indexing

07/29/2017
by   Ulrich Germann, et al.
0

We apply cross-lingual Latent Semantic Indexing to the Bilingual Document Alignment Task at WMT16. Reduced-rank singular value decomposition of a bilingual term-document matrix derived from known English/French page pairs in the training data allows us to map monolingual documents into a joint semantic space. Two variants of cosine similarity between the vectors that place each document into the joint semantic space are combined with a measure of string similarity between corresponding URLs to produce 1:1 alignments of English/French web pages in a variety of domains. The system achieves a recall of ca. 88 and 93 Analysing the system's errors on the training data, we argue that evaluating aligner performance based on exact URL matches under-estimates their true performance and propose an alternative that is able to account for duplicates and near-duplicates in the underlying data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/31/2020

Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover's Distance

Cross-lingual document alignment aims to identify pairs of documents in ...
research
11/10/2019

A Massive Collection of Cross-Lingual Web-Document Pairs

Cross-lingual document alignment aims to identify pairs of documents in ...
research
11/02/2020

Cross-Lingual Document Retrieval with Smooth Learning

Cross-lingual document search is an information retrieval task in which ...
research
04/30/2020

Exploiting Sentence Order in Document Alignment

In this work, we exploit the simple idea that a document and its transla...
research
07/29/2018

Fast derivation of neural network based document vectors with distance constraint and negative sampling

A universal cross-lingual representation of documents is very important ...
research
11/28/2014

Coarse-grained Cross-lingual Alignment of Comparable Texts with Topic Models and Encyclopedic Knowledge

We present a method for coarse-grained cross-lingual alignment of compar...
research
05/22/2023

Towards Unsupervised Recognition of Semantic Differences in Related Documents

Automatically highlighting words that cause semantic differences between...

Please sign up or login with your details

Forgot password? Click here to reset