Learning Cross-lingual Embeddings from Twitter via Distant Supervision

05/17/2019
by   Jose Camacho-Collados, et al.
0

Cross-lingual embeddings represent the meaning of words from different languages in the same vector space. Recent work has shown that it is possible to construct such representations by aligning independently learned monolingual embedding spaces, and that accurate alignments can be obtained even without external bilingual data. In this paper we explore a research direction which has been surprisingly neglected in the literature: leveraging noisy user-generated text to learn cross-lingual embeddings particularly tailored towards social media applications. While the noisiness and informal nature of the social media genre poses additional challenges to cross-lingual embedding methods, we find that it also provides key opportunities due to the abundance of code-switching and the existence of a shared vocabulary of emoji and named entities. Our contribution consists in a very simple post-processing step that exploits these phenomena to significantly improve the performance of state-of-the-art alignment methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/21/2019

On the Robustness of Unsupervised and Semi-supervised Cross-lingual Word Embedding Learning

Cross-lingual word embeddings are vector representations of words in dif...
research
04/21/2022

Cross-Lingual Query-Based Summarization of Crisis-Related Social Media: An Abstractive Approach Using Transformers

Relevant and timely information collected from social media during crise...
research
08/27/2018

Improving Cross-Lingual Word Embeddings by Meeting in the Middle

Cross-lingual word embeddings are becoming increasingly important in mul...
research
09/18/2017

Limitations of Cross-Lingual Learning from Image Search

Cross-lingual representation learning is an important step in making NLP...
research
11/12/2016

Multilingual Knowledge Graph Embeddings for Cross-lingual Knowledge Alignment

Many recent works have demonstrated the benefits of knowledge graph embe...
research
06/05/2020

Cross-lingual Transfer Learning for COVID-19 Outbreak Alignment

The spread of COVID-19 has become a significant and troubling aspect of ...
research
06/14/2021

Dataset of Propaganda Techniques of the State-Sponsored Information Operation of the People's Republic of China

The digital media, identified as computational propaganda provides a pat...

Please sign up or login with your details

Forgot password? Click here to reset