Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and its Application to Polylingual Text Classification

01/31/2019
by   Andrea Esuli, et al.
0

Polylingual Text Classification (PLC) consists of automatically classifying, according to a common set C of classes, documents each written in one of a set of languages L, and doing so more accurately than when naively classifying each document via its corresponding language-specific classifier. In order to obtain an increase in the classification accuracy for a given language, the system thus needs to also leverage the training examples written in the other languages. We tackle multilabel PLC via funnelling, a new ensemble learning method that we propose here. Funnelling consists of generating a two-tier classification system where all documents, irrespectively of language, are classified by the same (2nd-tier) classifier. For this classifier all documents are represented in a common, language-independent feature space consisting of the posterior probabilities generated by 1st-tier, language-dependent classifiers. This allows the classification of all test documents, of any language, to benefit from the information present in all training documents, of any language. We present substantial experiments, run on publicly available polylingual text collections, in which funnelling is shown to significantly outperform a number of state-of-the-art baselines. All code and datasets (in vector form) are made publicly available.

READ FULL TEXT
research
01/31/2019

Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and its Application to Cross-Lingual Text Classification

Cross-lingual Text Classification (CLC) consists of automatically classi...
research
01/14/2019

Albanian Language Identification in Text Documents

In this work we investigate the accuracy of standard and state-of-the-ar...
research
01/16/2017

Semantic classifier approach to document classification

In this paper we propose a new document classification method, bridging ...
research
06/18/2018

Source Printer Classification using Printer Specific Local Texture Descriptor

The knowledge of source printer can help in printed text document authen...
research
11/04/2021

Unsupervised and Distributional Detection of Machine-Generated Text

The power of natural language generation models has provoked a flurry of...
research
09/29/2020

Immigration Document Classification and Automated Response Generation

In this paper, we consider the problem of organizing supporting document...
research
05/03/2014

Kaggle LSHTC4 Winning Solution

Our winning submission to the 2014 Kaggle competition for Large Scale Hi...

Please sign up or login with your details

Forgot password? Click here to reset