Recurrent Neural Network Language Model Adaptation Derived Document Vector

11/01/2016
by   Wei Li, et al.
0

In many natural language processing (NLP) tasks, a document is commonly modeled as a bag of words using the term frequency-inverse document frequency (TF-IDF) vector. One major shortcoming of the frequency-based TF-IDF feature vector is that it ignores word orders that carry syntactic and semantic relationships among the words in a document, and they can be important in some NLP tasks such as genre classification. This paper proposes a novel distributed vector representation of a document: a simple recurrent-neural-network language model (RNN-LM) or a long short-term memory RNN language model (LSTM-LM) is first created from all documents in a task; some of the LM parameters are then adapted by each document, and the adapted parameters are vectorized to represent the document. The new document vectors are labeled as DV-RNN and DV-LSTM respectively. We believe that our new document vectors can capture some high-level sequential information in the documents, which other current document representations fail to capture. The new document vectors were evaluated in the genre classification of documents in three corpora: the Brown Corpus, the BNC Baby Corpus and an artificially created Penn Treebank dataset. Their classification performances are compared with the performance of TF-IDF vector and the state-of-the-art distributed memory model of paragraph vector (PV-DM). The results show that DV-LSTM significantly outperforms TF-IDF and PV-DM in most cases, and combinations of the proposed document vectors with TF-IDF or PV-DM may further improve performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2015

Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval

This paper develops a model that addresses sentence embedding, a hot top...
research
03/16/2017

Improving Document Clustering by Eliminating Unnatural Language

Technical documents contain a fair amount of unnatural language, such as...
research
10/01/2020

How LSTM Encodes Syntax: Exploring Context Vectors and Semi-Quantization on Natural Text

Long Short-Term Memory recurrent neural network (LSTM) is widely used an...
research
02/19/2016

Contextual LSTM (CLSTM) models for Large scale NLP tasks

Documents exhibit sequential structure at multiple levels of abstraction...
research
11/26/2022

Searching for Discriminative Words in Multidimensional Continuous Feature Space

Word feature vectors have been proven to improve many NLP tasks. With re...
research
06/04/2021

Recurrent Neural Networks with Mixed Hierarchical Structures for Natural Language Processing

Hierarchical structures exist in both linguistics and Natural Language P...
research
10/13/2020

Legal Document Classification: An Application to Law Area Prediction of Petitions to Public Prosecution Service

In recent years, there has been an increased interest in the application...

Please sign up or login with your details

Forgot password? Click here to reset