Visual Grounding in Video for Unsupervised Word Translation

by   Gunnar A. Sigurdsson, et al.

There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word mapping between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instructional videos narrated in the native language. Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods – it is more robust, handles datasets with less commonality, and is applicable to low-resource languages. We apply these methods to translate words from English to French, Korean, and Japanese – all without any parallel corpora and simply by watching many videos of people speaking while doing things.


page 1

page 2

page 6


Mapping Supervised Bilingual Word Embeddings from English to low-resource languages

It is very challenging to work with low-resource languages due to the in...

Emergent Translation in Multi-Agent Communication

While most machine translation systems to date are trained on large para...

Unsupervised Separation of Native and Loanwords for Malayalam and Telugu

Quite often, words from one language are adopted within a different lang...

A Visual Attention Grounding Neural Model for Multimodal Machine Translation

We introduce a novel multimodal machine translation model that utilizes ...

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Recent research has shown that word embedding spaces learned from text c...

Revisiting Adversarial Autoencoder for Unsupervised Word Translation with Cycle Consistency and Improved Training

Adversarial training has shown impressive success in learning bilingual ...

Joint Transformer/RNN Architecture for Gesture Typing in Indic Languages

Gesture typing is a method of typing words on a touch-based keyboard by ...

Code Repositories


Project page for "Visual Grounding in Video for Unsupervised Word Translation" CVPR 2020

view repo