N-gram-Based Low-Dimensional Representation for Document Classification

12/19/2014
by   Rémi Lebret, et al.
0

The bag-of-words (BOW) model is the common approach for classifying documents, where words are used as feature for training a classifier. This generally involves a huge number of features. Some techniques, such as Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA), have been designed to summarize documents in a lower dimension with the least semantic information loss. Some semantic information is nevertheless always lost, since only words are considered. Instead, we aim at using information coming from n-grams to overcome this limitation, while remaining in a low-dimension space. Many approaches, such as the Skip-gram model, provide good word vector representations very quickly. We propose to average these representations to obtain representations of n-grams. All n-grams are thus embedded in a same semantic space. A K-means clustering can then group them into semantic concepts. The number of features is therefore dramatically reduced and documents can be represented as bag of semantic concepts. We show that this model outperforms LSA and LDA on a sentiment classification task, and yields similar results than a traditional BOW-model with far less features.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/28/2015

Topic2Vec: Learning Distributed Representations of Topics

Latent Dirichlet Allocation (LDA) mining thematic structure of documents...
research
07/05/2017

The Influence of Feature Representation of Text on the Performance of Document Classification

In this paper we perform a comparative analysis of three models for feat...
research
12/23/2016

"What is Relevant in a Text Document?": An Interpretable Machine Learning Approach

Text documents can be described by a number of abstract concepts such as...
research
04/11/2012

Concept Modeling with Superwords

In information retrieval, a fundamental goal is to transform a document ...
research
12/27/2015

Learning Document Embeddings by Predicting N-grams for Sentiment Classification of Long Movie Reviews

Despite the loss of semantic information, bag-of-ngram based methods sti...
research
04/23/2018

Discovering Style Trends through Deep Visually Aware Latent Item Embeddings

In this paper, we explore Latent Dirichlet Allocation (LDA) and Polyling...
research
07/28/2023

SAP-sLDA: An Interpretable Interface for Exploring Unstructured Text

A common way to explore text corpora is through low-dimensional projecti...

Please sign up or login with your details

Forgot password? Click here to reset