TF-CR: Weighting Embeddings for Text Classification

by   Arkaitz Zubiaga, et al.

Text classification, as the task consisting in assigning categories to textual instances, is a very common task in information science. Methods learning distributed representations of words, such as word embeddings, have become popular in recent years as the features to use for text classification tasks. Despite the increasing use of word embeddings for text classification, these are generally used in an unsupervised manner, i.e. information derived from class labels in the training data are not exploited. While word embeddings inherently capture the distributional characteristics of words, and contexts observed around them in a large dataset, they aren't optimised to consider the distributions of words across categories in the classification dataset at hand. To optimise text representations based on word embeddings by incorporating class distributions in the training data, we propose the use of weighting schemes that assign a weight to embeddings of each word based on its saliency in each class. To achieve this, we introduce a novel weighting scheme, Term Frequency-Category Ratio (TF-CR), which can weight high-frequency, category-exclusive words higher when computing word embeddings. Our experiments on 16 classification datasets show the effectiveness of TF-CR, leading to improved performance scores over existing weighting schemes, with a performance gap that increases as the size of the training data grows.


Exploiting Class Labels to Boost Performance on Embedding-based Text Classification

Text classification is one of the most frequent tasks for processing tex...

An Enhanced Text Classification to Explore Health based Indian Government Policy Tweets

Government-sponsored policy-making and scheme generations is one of the ...

Toward Automated Website Classification by Deep Learning

In recent years, the interest in Big Data sources has been steadily grow...

Learning to Weight for Text Classification

In information retrieval (IR) and related tasks, term weighting approach...

Active Discriminative Text Representation Learning

We propose a new active learning (AL) method for text classification wit...

Text Classification with Few Examples using Controlled Generalization

Training data for text classification is often limited in practice, espe...

Neural Text Classification by Jointly Learning to Cluster and Align

Distributional text clustering delivers semantically informative represe...

Please sign up or login with your details

Forgot password? Click here to reset