Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and its Application to Cross-Lingual Text Classification

01/31/2019
by   Andrea Esuli, et al.
0

Cross-lingual Text Classification (CLC) consists of automatically classifying, according to a common set C of classes, documents each written in one of a set of languages L, and doing so more accurately than when naively classifying each document via its corresponding language-specific classifier. In order to obtain an increase in the classification accuracy for a given language, the system thus needs to also leverage the training examples written in the other languages. We tackle multilabel CLC via funnelling, a new ensemble learning method that we propose here. Funnelling consists of generating a two-tier classification system where all documents, irrespectively of language, are classified by the same (2nd-tier) classifier. For this classifier all documents are represented in a common, language-independent feature space consisting of the posterior probabilities generated by 1st-tier, language-dependent classifiers. This allows the classification of all test documents, of any language, to benefit from the information present in all training documents, of any language. We present substantial experiments, run on publicly available multilingual text collections, in which funnelling is shown to significantly outperform a number of state-of-the-art baselines. All code and datasets (in vector form) are made publicly available.

READ FULL TEXT
research
01/31/2019

Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and its Application to Polylingual Text Classification

Polylingual Text Classification (PLC) consists of automatically classify...
research
02/12/2018

Automatic Generation of Language-Independent Features for Cross-Lingual Classification

Many applications require categorization of text documents using predefi...
research
05/05/2017

Cross-lingual Distillation for Text Classification

Cross-lingual text classification(CLTC) is the task of classifying docum...
research
04/16/2019

Cross-Lingual Sentiment Quantification

We discuss Cross-Lingual Text Quantification (CLTQ), the task of perform...
research
10/19/2018

Revisiting Distributional Correspondence Indexing: A Python Reimplementation and New Experiments

This paper introduces PyDCI, a new implementation of Distributional Corr...
research
11/04/2021

Unsupervised and Distributional Detection of Machine-Generated Text

The power of natural language generation models has provoked a flurry of...
research
05/03/2014

Kaggle LSHTC4 Winning Solution

Our winning submission to the 2014 Kaggle competition for Large Scale Hi...

Please sign up or login with your details

Forgot password? Click here to reset