Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

12/22/2018
by   Mozhi Zhang, et al.
0

Text classification must sometimes be applied in situations with no training data in a target language. However, training data may be available in a related language. We introduce a cross-lingual document classification framework (CACO) between related language pairs. To best use limited training data, our transfer learning scheme exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for input words from their written forms, and the classifier makes predictions based on the word vectors. We use a joint character representation for both the source language and the target language, which allows the embedder to generalize knowledge about source language words to target language words with similar forms. We propose a multi-task objective that can further improve the model if additional cross-lingual or monolingual resources are available. CACO models trained under low-resource settings rival cross-lingual word embedding models trained under high-resource settings on related language pairs.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 6

page 7

page 9

page 12

research
06/08/2019

Improving Low-Resource Cross-lingual Document Retrieval by Reranking with Deep Bilingual Representations

In this paper, we propose to boost low-resource cross-lingual document r...
research
07/22/2020

Effects of Language Relatedness for Cross-lingual Transfer Learning in Character-Based Language Models

Character-based Neural Network Language Models (NNLM) have the advantage...
research
12/19/2017

Cross-language Framework for Word Recognition and Spotting of Indic Scripts

Handwritten word recognition and spotting of low-resource scripts are di...
research
06/23/2019

Cross-lingual Data Transformation and Combination for Text Classification

Text classification is a fundamental task for text data mining. In order...
research
03/03/2023

Team Hitachi at SemEval-2023 Task 3: Exploring Cross-lingual Multi-task Strategies for Genre and Framing Detection in Online News

This paper explains the participation of team Hitachi to SemEval-2023 Ta...
research
11/12/2018

Unseen Word Representation by Aligning Heterogeneous Lexical Semantic Spaces

Word embedding techniques heavily rely on the abundance of training data...
research
03/28/2022

Isomorphic Cross-lingual Embeddings for Low-Resource Languages

Cross-Lingual Word Embeddings (CLWEs) are a key component to transfer li...

Please sign up or login with your details

Forgot password? Click here to reset