A Survey of Neural Network Techniques for Feature Extraction from Text

04/27/2017 ∙ by Vineet John, et al. ∙ University of Waterloo 0

This paper aims to catalyze the discussions about text feature extraction techniques using neural network architectures. The research questions discussed in the paper focus on the state-of-the-art neural network techniques that have proven to be useful tools for language processing, language generation, text classification and other computational linguistics tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation

A majority of the methods currently in use for text-based feature extraction rely on relatively simple statistical techniques. For instance, a word co-occurrence model like n-grams or a bag-of-words model like TF-IDF.

The motivation of this research project is to identify and survey the techniques that use neural networks and study them in juxtaposition with the traditional text feature extraction models to show their differences in approach.

Feature extraction of text can be used for a multitude of applications including - but not limited to - unsupervised semantic similarity detection, article classification and sentiment analysis.

The goal of this project is to document of the differences, advantages and drawbacks in the domain of feature extraction from text data using neural networks. It also sketches the evolution of such techniques over time.

This report could serve as a quick cheat-sheet for engineers looking to build a text classification or regression pipeline, as the discussion (Section 15) would serve to map a use-cases to feature extraction implementation specifics.

2 Research Questions

  • What are the relatively simple statistical techniques to extract features from text?

  • Is there any inherent benefit to using neural networks as opposed to the simple methods?

  • What are the trade-offs that neural networks incur as opposed to the simple methods?

  • How do the different techniques compare to each other in terms of performance and accuracy?

  • In what use-cases do the trade-offs outweigh the benefits of neural networks?

3 Methodology

The research questions listed in Section 2 will be tackled by surveying a few of the important overview papers on the topicGoldberg (2016)Bengio et al. (2003)Morin and Bengio (2005). A few of the groundbreaking research papers in this area will also be studied, including word embeddingsMikolov et al. (2013a)Mikolov et al. (2013b)Mikolov et al. (2013c).

In addition to this, other less-obvious methods of features extraction will be surveyed, including tasks like part-of-speech tagging, chunking, named entity recognition, and semantic role labeling

Socher et al. (2011)Luong et al. (2013)Maas et al. (2015)Li et al. (2015)Collobert et al. (2011)Pennington et al. (2014).

4 Background

This section provides a high level background of the tasks within Computational Linguistics.

4.1 Part-of-Speech Tagging

  • POS tagging aims to label each word with a unique tag that indicates its syntactic role, like noun, verb, adjective etc.

  • The best POS taggers are based on classifiers trained on windows of text, which are then fed to a bidirectional decoding algorithm during inference.

  • In general, models resemble a bi-directional dependency network, and can be trained using a variety of methods including support vector machines and bi-directional Viterbi decoders.

4.2 Chunking

  • Chunking aims to label segments of a sentence with syntactic constituents such as noun or verb phrases. It is also called shallow parsing and can be viewed as a generalization of part-of-speech tagging to phrases instead of words.

  • Implementations of chunking usually require an underlying POS implementation, after which the words are compounded or chunked by concatenation.

4.3 Named Entity Recognition

  • NER labels atomic elements in a sentence into categories such as “PERSON” or “LOCATION”.

  • Features to train NER classifiers include POS tags, CHUNK tags, prefixes and suffixes, and large lexicons of the labeled entities.

4.4 Semantic Role Labeling

  • SRL aims to assign a semantic role to a syntactic constituent of a sentence.

  • State-of-the-art SRL systems consist of several stages: producing a parse tree, identifying which parse tree nodes represent the arguments of a given verb, and finally classifying these nodes to compute the corresponding SRL tags.

  • SRL systems usually entail numerous features like the parts of speech and syntactic labels of words and nodes in the tree, the syntactic path to the verb in the parse tree, whether a node in the parse tree is part of a noun or verb phrase etc.

5 Document Vectorization

Document vectorization is needed to convert text content into a numeric vector representation that can be utilized as features, which can then be used to train a machine learning model on. This section talks about a few different statistical methods for computing this feature vector

John and Vechtomova (2017).

5.1 N-gram Model

N-grams are contiguous sequences of ‘n’ items from a given sequence of text or speech. Given a complete corpus of documents, each tuple of ‘n’ grams, either characters or words are represented by a unique bit in a bit vector, which, when aggregated for a body of text, form a sparse vectorized representation of the text in the form of n-gram occurrences.

5.2 TF-IDF Model

Term frequency - inverse document frequency (TF-IDF), is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus Sparck Jones (1972). The TF-IDF value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. It is a bag-of-words model, and doesn’t preserve word ordering.

5.3 Paragraph Vector Model

A Paragraph Vector model is comprised of an unsupervised learning algorithm that learns fixed-size vector representations for variable-length pieces of texts such as sentences and documents

Le and Mikolov (2014). The vector representations are learned to predict the surrounding words in contexts sampled from the paragraph.

Two distinct implementations have gained prominence in the community.

6 A Primer of Neural Net Models for NLPGoldberg (2016)

  • Fully connected feed-forward neural networks are non-linear learners that can be used as a drop-in replacement wherever a linear learner is used.

  • The high accuracy observed in experimental results is a consequence of this non-linearity along with the availability of pre-trained word embeddings.

  • Multi-layer feed-forward networks can provide competitive results on sentiment classification and factoid question answering

  • Convolutional and pooling architecture show promising results on many tasks, including document classification, short-text categorization, sentiment classification, relation type classification between entities, event detection, paraphrase identification, semantic role labeling, question answering, predicting box-office revenues of movies based on critic reviews, modeling text interestingness, and modeling the relation between character-sequences and part-of-speech tags.

  • Convolutional and pooling architectures allow us to encode arbitrarily large items as fixed size vectors capturing their most salient features, but, they do so by sacrificing most of the structural information.

  • Recurrent and recursive networks allows using sequences and trees and preserve the structural information.

  • Recurrent models have been shown to produce very strong results for language modeling as well as for sequence tagging, machine translation, dependency parsing, sentiment analysis, noisy text normalization, dialog state tracking, response generation, and modeling the relation between character sequences and part-of-speech tags.

  • Recursive models were shown to produce state-of-the-art or near state-of-the-art results for constituency and dependency parse re-ranking, discourse parsing, semantic relation classification, political ideology detection based on parse trees, sentiment classification, target-dependent sentiment classification and question answering.

  • Convolutional nets are observed to to work well for summarization related tasks, just as recurrent/recursive nets work well for language modeling tasks.

7 A Neural Probabilistic Language Model

Goal: Knowing the basic structure of a sentence, one should be able to create a new sentence by replacing parts of the old sentence with interchangeable entitiesBengio et al. (2003).

Challenge: The main bottleneck is computing the activations of the output layer, since it is a fully-connected softmax activation layer.


  • One of the major contributions of this paper in terms of optimizations was data parallel processing (different processors working on a different subsets of data) and asynchronous processor usage of shared memory.

  • The authors propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.

  • A fundamental problem that makes language modeling and other learning problems difficult is the curse of dimensionality. It is particularly obvious in the case when one wants to model the joint distribution between many discrete random variables (such as words in a sentence, or discrete attributes in a data-mining task).

  • State-of-the art results are typically obtained using trigrams.

  • Language generation via substitution of semantically similar language constructs of existing sentences can be done via shared-parameter multi-layer neural networks.

  • The objective of this paper is to obtain real-valued vector sequences of words and learn a joint probability function for those sequences of words alongside the feature vector, and hence, jointly learn both the real-valued vector representation and the parameters of the probability distribution.

  • This probability function can be tuned in order to maximize log-likelihood of the training data, while penalizing the cost function, similar to the penalty term one used in Ridge regression.

  • This will ensure that semantically similar words end up with an almost equivalent feature vectors, called learned distributed feature vectors.

  • A challenge with modeling discrete variables like a sentence structure as opposed to a continuous value is that the continuous valued function can be assumed to have some form of locality, but the same assumption cannot be made in case of discrete functions.

  • N-gram models try to achieve a statistical modeling of languages by calculating the conditional probabilities of each possible word that can follow a set of preceding words.

  • New sequences of words can be generated by effectively gluing together the popular combinations i.e. n-grams with very high frequency counts.

8 Hierarchical Probabilistic Neural Network Language Model


Implementing a hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition. The hierarchical decomposition is a binary hierarchical clustering constrained by the prior knowledge extracted from the WordNet

222https://wordnet.princeton.edu/ semantic hierarchyMorin and Bengio (2005).


  • Similar to the previous paper, attempts to tackle the ‘curse of dimensionality’ (Section 7) and attempts to produce a much faster variant.

  • Back-off n-grams are used to learn a real-valued vector representation of each word.

  • The word embeddings learned are shared across all the participating nodes in the distributed architecture.

  • A very important component of the whole model is the choice of the words binary encoding, i.e. of the hierarchical word clustering. In this paper the authors combine empirical statistics with prior knowledge from the WordNet resource.

9 A Hierarchical Neural Autoencoder for Paragraphs and Documents

Goal: Attempts to build a paragraph embedding from the underlying word and sentence embeddings, and then proceeds to encode the paragraph embedding in an attempt to reconstruct the original paragraphLi et al. (2015).


  • The implementation uses an LSTM layer to convert words into a vector representation of a sentence. A subsequent LSTM layer converts multiple sentences into a paragraph.

  • For this to happen, we need to preserve, syntactic, semantic and discourse related properties while creating the embedded representation.

  • Hierarchical LSTM utilized to preserve sentence structure.

  • Parameters are estimated by maximizing likelihood of outputs given inputs, similar to standard sequence-to-sequence models.

  • Estimates are calculated using softmax functions to maximize the likelihood of the constituent words.

  • Attention models using the hierarchical autoencoder could be utilized for dialog systems, since it explicitly models for discourse.

10 Linguistic Regularities in Continuous Space Word Representations

Goal: In this paper, the authors examine the vector-space word representations that are implicitly learned by the input-layer weights. These representations are surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset. This allows vector-oriented reasoning based on the offsets between wordsMikolov et al. (2013c). This is one of the seminal papers that led to the creation of Word2Vec, which is a state-of-the-art word embedding toolMikolov et al. (2013a).


  • A defining feature of neural network language models is their representation of words as high dimensional real-valued vectors.

  • In this model, words are converted via a learned lookup-table into real valued vectors which are used as the inputs to a neural network.

  • One of the main advantages of these models is that the distributed representation achieves a level of generalization that is not possible with classical n-gram language models.

  • The word representations in this paper are learned by a recurrent neural network language model.

  • The input vector represents input word at time encoded using 1-of-N coding, and the output layer produces a probability distribution over words. The hidden layer maintains a representation of the sentence history. The input vector and the output vector have dimensionality of the vocabulary.

  • The values in the hidden and output layers are computed as follows:

    where and

    Figure 1: RNN Language Model
  • One of the biggest features of having real-valued feature representations is the ability to compute the answer to an analogy question where is unknown. With continuous space word representations, this becomes as simple as calculating

    is the best estimate of that the model could compute. If there is no vector amongst the trained words such that

    , the nearest vector representation can be estimated using cosine similarity.

11 Better Word Representations with Recursive Neural Networks for Morphology

Goal: The paper aims to address the inaccuracy in vector representations of complex and rare words, supposedly caused by the lack of relation between morphologically related wordsLuong et al. (2013).


  • The authors treat each morpheme as a basic unit in the RNNs and construct representations for morphologically complex words on the fly from their morphemes. By training a neural language model (NLM) and integrating RNN structures for complex words, they utilize contextual information to learn morphemic semantics and their compositional properties.

  • Discusses a problem that the Word2Vec syntactic relations like

    might not hold true if the vector representation of a rare word is inaccurate to begin with.

  • morphoRNN operates at the morpheme level rather than the word level. An example of the this is illustrated in Figure 2.

    Figure 2: morphoRNN
  • Parent words are created by combining a stem vector and an affix vector, as shown in Equation 1.

  • The cost function is expression in terms of the squared Euclidean loss between the newly constructed representation and the reference representation . The cost function is given in Equation 2.

  • The paper describes both context sensitive and insensitive versions of the Morphological RNN.

  • Similar to a typical RNN, the network is trained by computing the activation functions and propagating the errors backward in a forward-backward pass architecture.

  • This RNN model performs better than most of the other neural language models, and could be used to supplement word vectors.

12 Efficient Estimation of Word Representations in Vector Space

Goal: The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabularyMikolov et al. (2013a).

Challenge: The complexity that arises at the fully-connected output layer of the neural network is the dominant part of the computation. A couple of methods suggested to mitigate this is to use hierarchical versions of the softmax output activation units, or to refrain from performing normalization at the final layer altogether.


  • The ideas presented in this paper build on the previous ideas presented by Bengio et al. (2003).

  • The objective was to obtain high-quality word embeddings that capture the syntactic and semantic characteristics of words in a manner that allows algebraic operations to proxy the distances in vector space.


  • The training time here scales with the dimensionality of the learned feature vectors and not on the volume of training data.

  • The approach attempts to find a distributed vector representation of values as opposed to a continuous representation of values as computed by methods like LSA and LDA.

  • The models are trained using stochastic gradient descent and backpropagation.

  • The RNN models are touted to have an inherently better representation of sentence structure for complex patterns, without the need to specify context length.

  • To allow for the distributed training of the data, the framework DistBelief was used with multiple replicas of the model. Adagrad was utilized for asynchronous gradient descent.

  • Two distinct models were conceptualized for the training of the word vectors based on context, both of which are continuous and distributed representations of words. These are illustrated in Figure 3.

    Figure 3: CBOW and Skip-gram models
    • Continuous Bag-of-Words model: This model uses the context of a word i.e. the words that precede and follow it, to predict the current word.

    • Skip-gram model: This model uses the current word to predict the context it appeared in.

The experimental results show that the CBOW and skip-gram models consistently out-perform the then state-of-the-art models. It was also observed that after a point, increasing the dimensions and the size of the data began providing diminishing returns.

13 Distributed Representations of Words and Phrases and their Compositionality

Goal: This paper builds upon the idea of the Word2Vec skip-gram model, and presents optimizations in terms of quality of the word embeddings as well as speed-ups while training. It also proposes an alternative to the hierarchical softmax final layer, called negative samplingMikolov et al. (2013b).


  • One of the optimizations suggested is to sub-sample the training set words to achieve a speed-up in model training.

  • Given a sequence of training words , the objective of the skip-gram model is to maximize the average log probability shown in Equation 3


    where is the window or context surrounding the current word being trained on.

  • As introduced by Morin and Bengio (2005), a computationally efficient approximation of the full softmax is the hierarchical softmax. The hierarchical softmax uses a binary tree representation of the output layer with the W words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.

  • The authors use a binary Huffman tree, as it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models.

  • Noise Contrastive Estimation (NCE), which is an alternative to hierarchical softmax, posits that a good model should be able to differentiate data from noise by means of logistic regression.

  • To counter the imbalance between the rare and frequent words, we used a simple sub-sampling approach: each word within the training set is discarded with probability computed by the below formula.

    This is similar to a dropout of neurons from the network, except that it is statistically more likely that frequent words are removed from the corpus by virtue of this method.

  • Discarding the frequently occurring words allows for a reduction in computational and memory cost.

  • The individual words can easily be coalesced into phrases using unigram and bigram frequency counts, as shown below.

  • Another interesting property of learning these distributed representations is that the word and phrase representations learned by the skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetic.

14 Glove: Global Vectors for Word Representation

Goal: This paper proposes a global log-bilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methodsPennington et al. (2014).


  • While methods like LSA efficiently leverage statistical information, they do relatively poorly on word analogy tasks, indicating a sub-optimal vector space structure. Methods like skip-gram may do better on analogy tasks, but they poorly utilize the statistics of the corpus since they train on separate local context windows instead of on global co-occurrence counts.

  • The relationship between any arbitrary words can be examined by studying the ratio of their co-occurrence probabilities with various probe words.

  • The authors suggest that the appropriate starting point for word vector learning should be with ratios of co-occurrence probabilities rather than the probabilities themselves.

  • We can express this co-occurrence relation as shown below

    This makes the feature matrix interchangeable with its transpose.

  • An additive shift is included in the logarithm,

    which maintains the sparsity of X while avoiding the divergences while computing the co-occurrences matrix.

  • The model obtained in the paper could be compared to a global skip-gram model as opposed to a fixed window-size skip-gram model as proposed by Mikolov et al. (2013a).

  • The performance seems to increase monotonically with an increase in training data.

15 Discussion

Following the literature survey, this section re-visits the original research questions and provides a succinct summary that can be inferred from the experimental results and conclusions drawn from the original papers.

  • What are the relatively simple statistical techniques to extract features from text?
    Word count frequency models like n-gram and simple bag-of-words models such as TF-IDF are still the easiest tools to obtain an numeric vector representation of text.

  • Is there any inherent benefit to using neural networks as opposed to the simple methods?
    The benefit of using neural nets primarily is their ability to identify obscure patterns, and remain flexible enough for a varied set of application areas from topic classification to syntax parse-tree generation.

  • What are the trade-offs that neural networks incur as opposed to the simple methods?
    The trade-offs are typically expressed in terms of computational cost and memory usage, although model complexity is a factor too, given that neural nets can be trained to learn arbitrarily complex generative models.

  • How do the different techniques compare to each other in terms of performance and accuracy?

    This question can only be answered subjectively as it varies from application to application. Typically, document similarity can be tackled with a simple statistical approach like TF-IDF. CNNs inherently model input data in a manner that iteratively reduces the dimensionality, making it a great fit for topic classification and document summarization. RNNs are great at modeling sequences of text, which make them apt for language syntax modeling. Amongst the frameworks, GloVe’s pre-trained word-embeddings perform better than vanilla Word2Vec, which is considered state-of-the-art.

  • In what use-cases do the trade-offs outweigh the benefits of neural networks?
    As explained for the previous question, for a simple information retrieval use case such as document ranking, models such as TF-IDF, and word PMI (pointwise mutual information) are sufficient, and neural networks would be overkill in such use-cases.

16 Conclusion

This paper has summarized the important aspects of the state-of-the-art neural network techniques that have emerged in recent years. The field of machine translation, natural language understanding and natural language generation are important areas of research when it comes to developing a range of applications from a simple chatbot, to the conceptualization of a general AI entity.

The discussion section aggregates the results of the surveyed papers and offers a ready reference for new-comers to the field.

For future work, it is intended to experimentally compare different word-embedding approaches to act as a bootstrapping method to iteratively build high quality datasets for future machine learning model usage.

17 Acknowledgments

The author would like to thank Dr. Pascal Poupart for his constructive feedback on the survey proposal.