Cross-lingual Data Transformation and Combination for Text Classification

06/23/2019
by   Jun Jiang, et al.
0

Text classification is a fundamental task for text data mining. In order to train a generalizable model, a large volume of text must be collected. To address data insufficiency, cross-lingual data may occasionally be necessary. Cross-lingual data sources may however suffer from data incompatibility, as text written in different languages can hold distinct word sequences and semantic patterns. Machine translation and word embedding alignment provide an effective way to transform and combine data for cross-lingual data training. To the best of our knowledge, there has been little work done on evaluating how the methodology used to conduct semantic space transformation and data combination affects the performance of classification models trained from cross-lingual resources. In this paper, we systematically evaluated the performance of two commonly used CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network) text classifiers with differing data transformation and combination strategies. Monolingual models were trained from English and French alongside their translated and aligned embeddings. Our results suggested that semantic space transformation may conditionally promote the performance of monolingual models. Bilingual models were trained from a combination of both English and French. Our results indicate that a cross-lingual classification model can significantly benefit from cross-lingual data by learning from translated or aligned embedding spaces.

READ FULL TEXT
research
01/11/2016

Trans-gram, Fast Cross-lingual Word-embeddings

We introduce Trans-gram, a simple and computationally-efficient method t...
research
12/22/2018

Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

Text classification must sometimes be applied in situations with no trai...
research
08/27/2018

Improving Cross-Lingual Word Embeddings by Meeting in the Middle

Cross-lingual word embeddings are becoming increasingly important in mul...
research
02/10/2022

Slovene SuperGLUE Benchmark: Translation and Evaluation

We present a Slovene combined machine-human translated SuperGLUE benchma...
research
05/06/2020

A Multi-Perspective Architecture for Semantic Code Search

The ability to match pieces of code to their corresponding natural langu...
research
01/30/2020

Lost in Embedding Space: Explaining Cross-Lingual Task Performance with Eigenvalue Divergence

Performance in cross-lingual NLP tasks is impacted by the (dis)similarit...
research
06/10/2019

Char-RNN for Word Stress Detection in East Slavic Languages

We explore how well a sequence labeling approach, namely, recurrent neur...

Please sign up or login with your details

Forgot password? Click here to reset