Words are not Equal: Graded Weighting Model for building Composite Document Vectors

12/11/2015
by   Pranjal Singh, et al.
0

Despite the success of distributional semantics, composing phrases from word vectors remains an important challenge. Several methods have been tried for benchmark tasks such as sentiment classification, including word vector averaging, matrix-vector approaches based on parsing, and on-the-fly learning of paragraph vectors. Most models usually omit stop words from the composition. Instead of such an yes-no decision, we consider several graded schemes where words are weighted according to their discriminatory relevance with respect to its use in the document (e.g., idf). Some of these methods (particularly tf-idf) are seen to result in a significant improvement in performance over prior state of the art. Further, combining such approaches into an ensemble based on alternate classifiers such as the RNN model, results in an 1.6 performance improvement on the standard IMDB movie review dataset, and a 7.01 improvement on Amazon product reviews. Since these are language free models and can be obtained in an unsupervised manner, they are of interest also for under-resourced languages such as Hindi as well and many more languages. We demonstrate the language free aspects by showing a gain of 12 datasets over earlier results, and also release a new larger dataset for future testing (Singh,2015).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/27/2015

Learning Document Embeddings by Predicting N-grams for Sentiment Classification of Long Movie Reviews

Despite the loss of semantic information, bag-of-ngram based methods sti...
research
08/02/2015

Class Vectors: Embedding representation of Document Classes

Distributed representations of words and paragraphs as semantic embeddin...
research
09/12/2020

Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector

Bidirectional Long Short-Term Memory Network (Bi-LSTM) has shown promisi...
research
07/11/2019

No Word is an Island -- A Transformation Weighting Model for Semantic Composition

Composition models of distributional semantics are used to construct phr...
research
04/08/2019

Crosslingual Document Embedding as Reduced-Rank Ridge Regression

There has recently been much interest in extending vector-based word rep...
research
07/29/2015

Document Embedding with Paragraph Vectors

Paragraph Vectors has been recently proposed as an unsupervised method f...
research
01/05/2020

Generating Word and Document Embeddings for Sentiment Analysis

Sentiments of words differ from one corpus to another. Inducing general ...

Please sign up or login with your details

Forgot password? Click here to reset