Hierarchical Document Encoder for Parallel Corpus Mining

06/20/2019
by   Mandy Guo, et al.
0

We explore using multilingual document embeddings for nearest neighbor mining of parallel data. Three document-level representations are investigated: (i) document embeddings generated by simply averaging multilingual sentence embeddings; (ii) a neural bag-of-words (BoW) document encoding model; (iii) a hierarchical multilingual document encoder (HiDE) that builds on our sentence-level model. The results show document embeddings derived from sentence-level averaging are surprisingly effective for clean datasets, but suggest models trained hierarchically at the document-level are more effective on noisy data. Analysis experiments demonstrate our hierarchical models are very robust to variations in the underlying sentence embedding quality. Using document embeddings trained with HiDE achieves state-of-the-art performance on United Nations (UN) parallel document mining, 94.9 for en-es.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/22/2019

Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax

In this paper, we present an approach to learn multilingual sentence emb...
research
07/31/2018

Effective Parallel Corpus Mining using Bilingual Sentence Embeddings

This paper presents an effective approach for parallel corpus mining usi...
research
04/28/2023

Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?

Dense vector representations for textual data are crucial in modern NLP....
research
05/11/2023

A General-Purpose Multilingual Document Encoder

Massively multilingual pretrained transformers (MMTs) have tremendously ...
research
11/03/2018

Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

Machine translation is highly sensitive to the size and quality of the t...
research
05/10/2022

Sentence-level Privacy for Document Embeddings

User language data can contain highly sensitive personal content. As suc...
research
07/19/2016

An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation

Recently, Le and Mikolov (2014) proposed doc2vec as an extension to word...

Please sign up or login with your details

Forgot password? Click here to reset