Accessing Higher Dimensions for Unsupervised Word Translation

05/23/2023
by   Sida I. Wang, et al.
0

The striking ability of unsupervised word translation has been demonstrated with the help of word vectors / pretraining; however, they require large amounts of data and usually fails if the data come from different domains. We propose coocmap, a method that can use either high-dimensional co-occurrence counts or their lower-dimensional approximations. Freed from the limits of low dimensions, we show that relying on low-dimensional vectors and their incidental properties miss out on better denoising methods and useful world knowledge in high dimensions, thus stunting the potential of the data. Our results show that unsupervised translation can be achieved more easily and robustly than previously thought – less than 80MB and minutes of CPU time is required to achieve over 50% accuracy for English to Finnish, Hungarian, and Chinese translations when trained on similar data; even under domain mismatch, we show coocmap still works fully unsupervised on English NewsCrawl to Chinese Wikipedia and English Europarl to Spanish Wikipedia, among others. These results challenge prevailing assumptions on the necessity and superiority of low-dimensional vectors, and suggest that similarly processed co-occurrences can outperform dense vectors on other tasks too.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2018

Word2Bits - Quantized Word Vectors

Word vectors require significant amounts of memory and storage, posing i...
research
07/12/2019

The University of Edinburgh's Submissions to the WMT19 News Translation Task

The University of Edinburgh participated in the WMT19 Shared Task on New...
research
11/30/2022

Domain Mismatch Doesn't Always Prevent Cross-Lingual Transfer Learning

Cross-lingual transfer learning without labeled target language data or ...
research
04/11/2023

A Corpus-based Analysis of Attitudinal Changes in Lin Yutang's Self-translation of Between Tears and Laughter

Attitude is omnipresent in almost every type of text. There has yet to b...
research
02/12/2017

Vector Embedding of Wikipedia Concepts and Entities

Using deep learning for different machine learning tasks such as image c...
research
10/23/2022

Translation Word-Level Auto-Completion: What can we achieve out of the box?

Research on Machine Translation (MT) has achieved important breakthrough...

Please sign up or login with your details

Forgot password? Click here to reset