Fast derivation of neural network based document vectors with distance constraint and negative sampling

07/29/2018
by   Wei Li, et al.
0

A universal cross-lingual representation of documents is very important for many natural language processing tasks. In this paper, we present a document vectorization method which can effectively create document vectors via self-attention mechanism using a neural machine translation (NMT) framework. The model used by our method can be trained with parallel corpora that are unrelated to the task at hand. During testing, our method will take a monolingual document and convert it into a "Neural machine Translation framework based crosslingual Document Vector with distance constraint training" (cNTDV). cNTDV is a follow-up study from our previous research on the neural machine translation framework based document vector. The cNTDV can produce the document vector from a forward-pass of the encoder with fast speed. Moreover, it is trained with a distance constraint, so that the document vector obtained from different language pair is always consistent with each other. In a cross-lingual document classification task, our cNTDV embeddings surpass the published state-of-the-art performance in the English-to-German classification test, and, to our best knowledge, it also achieves the second best performance in German-to-English classification test. Comparing to our previous research, it does not need a translator in the testing process, which makes the model faster and more convenient.

READ FULL TEXT
research
11/20/2015

Improving Neural Machine Translation Models with Monolingual Data

Neural Machine Translation (NMT) has obtained state-of-the art performan...
research
10/16/2019

Using Whole Document Context in Neural Machine Translation

In Machine Translation, considering the document as a whole can help to ...
research
10/30/2017

Unsupervised Neural Machine Translation

In spite of the recent success of neural machine translation (NMT) in st...
research
08/19/2020

Transformer based Multilingual document Embedding model

One of the current state-of-the-art multilingual document embedding mode...
research
07/29/2019

CUNI Systems for the Unsupervised News Translation Task in WMT 2019

In this paper we describe the CUNI translation system used for the unsup...
research
07/29/2017

Bilingual Document Alignment with Latent Semantic Indexing

We apply cross-lingual Latent Semantic Indexing to the Bilingual Documen...
research
07/21/2023

Multimodal Document Analytics for Banking Process Automation

In response to growing FinTech competition and the need for improved ope...

Please sign up or login with your details

Forgot password? Click here to reset