Vector representations of text data in deep learning

01/07/2019
by   Karol Grzegorczyk, et al.
0

In this dissertation we report results of our research on dense distributed representations of text data. We propose two novel neural models for learning such representations. The first model learns representations at the document level, while the second model learns word-level representations. For document-level representations we propose Binary Paragraph Vector: a neural network models for learning binary representations of text documents, which can be used for fast document retrieval. We provide a thorough evaluation of these models and demonstrate that they outperform the seminal method in the field in the information retrieval task. We also report strong results in transfer learning settings, where our models are trained on a generic text corpus and then used to infer codes for documents from a domain-specific dataset. In contrast to previously proposed approaches, Binary Paragraph Vector models learn embeddings directly from raw text data. For word-level representations we propose Disambiguated Skip-gram: a neural network model for learning multi-sense word embeddings. Representations learned by this model can be used in downstream tasks, like part-of-speech tagging or identification of semantic relations. In the word sense induction task Disambiguated Skip-gram outperforms state-of-the-art models on three out of four benchmarks datasets. Our model has an elegant probabilistic interpretation. Furthermore, unlike previous models of this kind, it is differentiable with respect to all its parameters and can be trained with backpropagation. In addition to quantitative results, we present qualitative evaluation of Disambiguated Skip-gram, including two-dimensional visualisations of selected word-sense embeddings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/03/2016

Binary Paragraph Vectors

Recently Le & Mikolov described two log-linear models, called Paragraph ...
research
09/27/2017

KeyVec: Key-semantics Preserving Document Representations

Previous studies have demonstrated the empirical success of word embeddi...
research
09/10/2019

Neural Embedding Allocation: Distributed Representations of Topic Models

Word embedding models such as the skip-gram learn vector representations...
research
02/23/2017

LTSG: Latent Topical Skip-Gram for Mutually Learning Topic Model and Vector Representations

Topic models have been widely used in discovering latent topics which ar...
research
05/04/2017

KATE: K-Competitive Autoencoder for Text

Autoencoders have been successful in learning meaningful representations...
research
07/20/2017

Toward Incorporation of Relevant Documents in word2vec

Recent advances in neural word embedding provide significant benefit to ...
research
11/18/2019

Improving Document Classification with Multi-Sense Embeddings

Efficient representation of text documents is an important building bloc...

Please sign up or login with your details

Forgot password? Click here to reset