exploring Quora duplicate question classification task with Gensim
Recently, Le and Mikolov (2014) proposed doc2vec as an extension to word2vec (Mikolov et al., 2013a) to learn document-level embeddings. Despite promising results in the original paper, others have struggled to reproduce those results. This paper presents a rigorous empirical evaluation of doc2vec over two tasks. We compare doc2vec to two baselines and two state-of-the-art document embedding methodologies. We found that doc2vec performs robustly when using models trained on large external corpora, and can be further improved by using pre-trained word embeddings. We also provide recommendations on hyper-parameter settings for general purpose applications, and release source code to induce document embeddings using our trained doc2vec models.READ FULL TEXT VIEW PDF
In recent times, word embeddings are taking a significant role in sentim...
General-purpose representation learning through large-scale pre-training...
Natural language processing for document scans and PDFs has the potentia...
In this paper, we introduce a comprehensive toolkit, ETNLP, which can
We explore using multilingual document embeddings for nearest neighbor m...
As abbreviations often have several distinct meanings, disambiguating th...
We examine a number of methods to compute a dense vector embedding for a...
exploring Quora duplicate question classification task with Gensim
Summer 2017 project at CLSP, JHU
Neural embeddings were first proposed by Bengio+:2003, in the form of a feed-forward neural network language model. Modern methods use a simpler and more efficient neural architecture to learn word vectors (word2vec: Mikolov+:2013c; GloVe: Pennington+:2014), based on objective functions that are designed specifically to produce high-quality vectors.
Neural embeddings learnt by these methods have been applied in a myriad of NLP applications, including initialising neural network models for objective visual recognition [Frome et al.2013] or machine translation [Zhang et al.2014, Li et al.2014], as well as directly modelling word-to-word relationships [Mikolov et al.2013a, Zhao et al.2015, Salehi et al.2015, Vylomova et al.to appear],
Paragraph vectors, or doc2vec, were proposed by Le+:2014 as a simple extension to word2vec to extend the learning of embeddings from words to word sequences.111The term doc2vec was popularised by Gensim [Řehůřek and Sojka2010], a widely-used implementation of paragraph vectors: https://radimrehurek.com/gensim/ doc2vec is agnostic to the granularity of the word sequence — it can equally be a word -gram, sentence, paragraph or document. In this paper, we use the term “document embedding” to refer to the embedding of a word sequence, irrespective of its granularity.
doc2vec was proposed in two forms: dbow and dmpv. dbow is a simpler model and ignores word order, while dmpv is a more complex model with more parameters (see Section 2 for details). Although Le+:2014 found that as a standalone method dmpv is a better model, others have reported contradictory results.222The authors of Gensim found dbow outperforms dmpv: https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb doc2vec has also been reported to produce sub-par performance compared to vector averaging methods based on informal experiments.333https://groups.google.com/forum/#!topic/gensim/bEskaT45fXQ
Additionally, while Le+:2014 report state-of-the-art results over a sentiment analysis task usingdoc2vec, others (including the second author of the original paper in follow-up work) have struggled to replicate this result.444For a detailed discussion on replicating the results of Le+:2014, see: https://groups.google.com/forum/#!topic/word2vec-toolkit/Q49FIrNOQRo
Given this background of uncertainty regarding the true effectiveness of doc2vec and confusion about performance differences between dbow and dmpv, we aim to shed light on a number of empirical questions: (1) how effective is doc2vec in different task settings?; (2) which is better out of dmpv and dbow?; (3) is it possible to improve doc2vec through careful hyper-parameter optimisation or with pre-trained word embeddings?; and (4) can doc2vec be used as an off-the-shelf model like word2vec? To this end, we present a formal and rigorous evaluation of doc2vec over two extrinsic tasks. Our findings reveal that dbow, despite being the simpler model, is superior to dmpv. When trained over large external corpora, with pre-trained word embeddings and hyper-parameter tuning, we find that doc2vec performs very strongly compared to both a simple word embedding averaging and -gram baseline, as well as two state-of-the-art document embedding approaches, and that doc2vec performs particularly strongly over longer documents. We additionally release source code for replicating our experiments, and for inducing document embeddings using our trained models.
word2vec was proposed as an efficient neural approach to learning high-quality embeddings for words [Mikolov et al.2013a]. Negative sampling was subsequently introduced as an alternative to the more complex hierarchical softmax step at the output layer, with the authors finding that not only is it more efficient, but actually produces better word vectors on average [Mikolov et al.2013b].
The objective function of word2vec
is to maximise the log probability of context word () given its input word (), i.e. . With negative sampling, the objective is to maximise the dot product of the and while minimising the dot product of and randomly sampled “negative” words. Formally, is given as follows:
is the sigmoid function,is the number of negative samples, is the noise distribution, is the vector of word , and is the negative sample vector of word .
There are two approaches within word2vec: skip-gram (“sg”) and cbow. In skip-gram, the input is a word (i.e. is a vector of one word) and the output is a context word. For each input word, the number of left or right context words to predict is defined by the window size hyper-parameter. cbow is different to skip-gram in one aspect: the input consists of multiple words that are combined via vector addition to predict the context word (i.e. is a summed vector of several words).
doc2vec is an extension to word2vec for learning document embeddings [Le and Mikolov2014]. There are two approaches within doc2vec: dbow and dmpv.
dbow works in the same way as skip-gram, except that the input is replaced by a special token representing the document (i.e. is a vector representing the document). In this architecture, the order of words in the document is ignored; hence the name distributed bag of words.
dmpv works in a similar way to cbow. For the input, dmpv introduces an additional document token in addition to multiple target words. Unlike cbow, however, these vectors are not summed but concatenated (i.e. is a concatenated vector containing the document token and several target words). The objective is again to predict a context word given the concatenated document and word vectors..
More recently, Kiros+:2015 proposed skip-thought as a means of learning document embeddings. skip-thought
vectors are inspired by abstracting the distributional hypothesis from the word level to the sentence level. Using an encoder-decoder neural network architecture, the encoder learns a dense vector presentation of a sentence, and the decoder takes this encoding and decodes it by predicting words of its next (or previous) sentence. Both the encoder and decoder use a gated recurrent neural network language model. Evaluating over a range of tasks, the authors found thatskip-thought vectors perform very well against state-of-the-art task-optimised methods.
Wieting+:2016-pre proposed a more direct way of learning document embeddings, based on a large-scale training set of paraphrase pairs from the Paraphrase Database (ppdb
: Ganitkevitch+:2013). Given a paraphrase pair, word embeddings and a method to compose the word embeddings for a sentence embedding, the objective function of the neural network model is to optimise the word embeddings such that the cosine similarity of the sentence embeddings for the pair is maximised. The authors explore several methods of combining word embeddings, and found that simple averaging produces the best performance.
We evaluate doc2vec in two task settings, specifically chosen to highlight the impact of document length on model performance.
For all tasks, we split the dataset into 2 partitions: development and test. The development set is used to optimise the hyper-parameters of doc2vec, and results are reported on the test set. We use all documents in the development and test set (and potentially more background documents, where explicitly mentioned) to train doc2vec. Our rationale for this is that the doc2vec training is completely unsupervised, i.e. the model takes only raw text and uses no supervised or annotated information, and thus there is no need to hold out the test data, as it is unlabelled. We ultimately relax this assumption in the next section (Section 4), when we train doc2vec using large external corpora.
After training doc2vec, document embeddings are generated by the model. For the word2vec baseline, we compute a document embedding by taking the component-wise mean of its component word embeddings. We experiment with both variants of doc2vec (dbow and dmpv) and word2vec (skip-gram and cbow) for all tasks.
In addition to word2vec
, we experiment with another baseline model that converts a document into a distribution over words via maximum likelihood estimation, and compute pairwise document similarity using the Jensen Shannon divergence.555We multiply the divergence value by to invert the value, so that a higher value indicates greater similarity. For word types we explore -grams of order and find that a combination of unigrams, bigrams and trigrams achieves the best results.666
That is, the probability distribution is computed over the union of unigrams, bigrams and trigrams in the paired documents.Henceforth, this second baseline will be referred to as ngram.
We first evaluate doc2vec
over the task of duplicate question detection in a web forum setting, using the dataset of Hoogeveen+:2015. The dataset has 12 subforums extracted from StackExchange, and provides training and test splits in two experimental settings: retrieval and classification. We use the classification setting, where the goal is to classify whether a given question pair is a duplicate.
The dataset is separated into the 12 subforums, with a pre-compiled training–test split per subforum; the total number of instances (question pairs) ranges from 50M to 1B pairs for the training partitions, and 30M to 300M pairs for the test partitions, depending on the subforum. The proportion of true duplicate pairs is very small in each subforum, but the setup is intended to respect the distribution of true duplicate pairs in a real-world setting.
We sub-sample the test partition to create a smaller test partition that has 10M document pairs.777Uniform random sampling is used so as to respect the original distribution. On average across all twelve subforums, there are 22 true positive pairs per 10M question pairs. We also create a smaller development partition from the training partition by randomly selecting 300 positive and 3000 negative pairs. We optimise the hyper-parameters of doc2vec and word2vec using the development partition on the tex subforum, and apply the same hyper-parameter settings for all subforums when evaluating over the test pairs. We use both the question title and body as document content: on average the test document length is approximately 130 words. We use the default tokenised and lowercased words given by the dataset. All test, development and un-sampled documents are pooled together during model training, and each subforum is trained separately.
We compute cosine similarity between documents using the vectors produced by doc2vec and word2vec
to score a document pair. We then sort the document pairs in descending order of similarity score, and evaluate using the area under the curve (AUC) of the receiver operating characteristic (ROC) curve . The ROC curve tracks the true positive rate against the false positive rate at each point of the ranking, and as such works well for heavily-skewed datasets. An AUC score of 1.0 implies that all true positive pairs are ranked before true negative pairs, while an AUC score of .5 indicates a random ranking. We present the full results for each subforum in Table1.
Comparing doc2vec and word2vec to ngram, both embedding methods perform substantially better in most domains, with two exceptions (english and gis), where ngram has comparable performance.
doc2vec outperforms word2vec embeddings in all subforums except for gis. Despite the skewed distribution, simple cosine similarity based on doc2vec embeddings is able to detect these duplicate document pairs with a high degree of accuracy. dbow performs better than or as well as dmpv in 9 out of the 12 subforums, showing that the simpler dbow is superior to dmpv.
One interesting exception is the english subforum, where dmpv is substantially better, and ngram — which uses only surface word forms — also performs very well. We hypothesise that the order and the surface form of words possibly has a stronger role in this subforum, as questions are often about grammar problems and as such the position and semantics of words is less predictable (e.g. Where does “for the same” come from?)
|Vector Size||Dimension of word vectors|
|Window Size||Left/right context window size|
|Min Count||Minimum frequency threshold for word types|
|Sub-sampling||Threshold to downsample high frequency words|
|Negative Sample||No. of negative word samples|
|Epoch||Number of training epochs|
The Semantic Textual Similarity (STS) task is a shared task held as part of *SEM and SemEval over a number of iterations [Agirre et al.2013, Agirre et al.2014, Agirre et al.2015]. In STS, the goal is to automatically predict the similarity of a pair of sentences in the range , where 0 indicates no similarity whatsoever and 5 indicates semantic equivalence.
The top systems utilise word alignment, and further optimise their scores using supervised learning[Agirre et al.2015]. Word embeddings are employed, although sentence embeddings are often taken as the average of word embeddings (e.g. Sultan+:2015).
We evaluate doc2vec and word2vec embeddings over the English STS sub-task of SemEval-2015 [Agirre et al.2015]. The dataset has 5 domains, and each domain has 375–750 annotated pairs. Sentences are much shorter than our previous task, at an average of only 13 words in each test sentence.
As the dataset is also much smaller, we combine sentences from all 5 domains and also sentences from previous years (2012–2014) to form the training data. We use the headlines domain from 2014 as development, and test on all 2015 domains. For pre-processing, we tokenise and lowercase the words using Stanford CoreNLP [Manning et al.2014].
As a benchmark, we include results from the overall top-performing system in the competition, referred to as “DLS” [Sultan et al.2015]. Note, however, that this system is supervised and highly customised to the task, whereas our methods are completely unsupervised. Results are presented in Table 2.
Unsurprisingly, we do not exceed the overall performance of the supervised benchmark system DLS, although doc2vec outperforms DLS over the domain of belief. ngram performs substantially worse than all methods (with an exception in ans-students where it outperforms dmpv and cbow).
Comparing doc2vec and word2vec, doc2vec performs better. However, the performance gap is lower compared to the previous two tasks, suggesting that the benefit of using doc2vec is diminished for shorter documents. Comparing dbow and dmpv, the difference is marginal, although dbow as a whole is slightly stronger, consistent with the observation of previous task.
Across the two tasks, we found that the optimal hyper-parameter settings (as described in Table 3) are fairly consistent for dbow and dmpv, as detailed in Table 4 (task abbreviations: Q-Dup Forum Question Duplication (Section 3.1); and STS Semantic Textual Similarity (Section 3.2)). Note that we did not tune the initial and minimum learning rates ( and , respectively), and use the the following values for all experiments: and . The learning rate decreases linearly per epoch from the initial rate to the minimum rate.
In general, dbow favours longer windows for context words than dmpv. Possibly the most important hyper-parameter is the sub-sampling threshold for high frequency words: in our experiments we find that task performance dips considerably when a sub-optimal value is used. dmpv also requires more training epochs than dbow. As a rule of thumb, for dmpv to reach convergence, the number of epochs is one order of magnitude larger than dbow. Given that dmpv has more parameters in the model, this is perhaps not a surprising finding.
In Section 3, all tasks were trained using small in-domain document collections. doc2vec is designed to scale to large data, and we explore the effectiveness of doc2vec by training it on large external corpora in this section.
We experiment with two external corpora: (1) wiki, the full collection of English Wikipedia;888Using the dump dated 2015-12-01, cleaned using WikiExtractor: https://github.com/attardi/wikiextractor and (2) ap-news, a collection of Associated Press English news articles from 2009 to 2015. We tokenise and lowercase the documents using Stanford CoreNLP [Manning et al.2014], and treat each natural paragraph of an article as a document for doc2vec. After pre-processing, we have approximately 35M documents and 2B tokens for wiki, and 25M and .9B tokens for ap-news. Seeing that dbow trains faster and is a better model than dmpv from Section 3, we experiment with only dbow here.999We use these hyper-parameter values for wiki (ap-news): vector size 300 (300), window size 15 (15), min count 20 (10), sub-sampling threshold 10 (10), negative sample 5, epoch 20 (30). After removing low frequency words, the vocabulary size is approximately 670K for wiki and 300K for ap-news.
To test if doc2vec can be used as an off-the-shelf model, we take a pre-trained model and infer an embedding for a new document without updating the hidden layer word weights.101010That is, test data is held out and not including as part of doc2vec training. We have three hyper-parameters for test inference: initial learning rate (), minimum learning rate (), and number of inference epochs. We optimise these parameters using the development partitions in each task; in general a small initial (= .01) with low (= .0001) and large epoch number (= 500–1000) works well.
For word2vec, we train skip-gram on the same corpora.111111Hyper-parameter values for wiki (ap-news): vector size 300 (300), window size 5 (5), min count 20 (10), sub-sampling threshold 10 (10), negative sample 5, epoch 100 (150) We also include the word vectors trained on the larger Google News by Mikolov+:2013c, which has 100B words.121212https://code.google.com/archive/p/word2vec/ The Google News skip-gram vectors will henceforth be referred to as gl-news.
dbow, skip-gram and ngram results for all two tasks are presented in Table 5. Between the baselines ngram and skip-gram, ngram appears to do better over Q-Dup, while skip-gram works better over STS.
As before, doc2vec outperforms word2vec and ngram across almost all tasks. For tasks with longer documents (Q-Dup), the performance gap between doc2vec and word2vec is more pronounced, while for STS, which has shorter documents, the gap is smaller. In some STS domains (e.g. ans-students) word2vec performs just as well as doc2vec. Interestingly, we see that gl-news word2vec embeddings perform worse than our wiki and ap-news word2vec embeddings, even though the Google News corpus is orders of magnitude larger.
Comparing doc2vec results with in-domain results (1 and 2), the performance is in general lower. As a whole, the performance difference between the dbow models trained using wiki and ap-news is not very large, indicating the robustness of these large external corpora for general-purpose applications. To facilitate applications using off-the-shelf doc2vec models, we have publicly released code and trained models to induce document embeddings using the wiki and ap-news dbow models.131313https://github.com/jhlau/doc2vec
We next calibrate the results for doc2vec against skip-thought [Kiros et al.2015] and paragram-phrase (pp: Wieting+:2016-pre), two recently-proposed competitor document embedding methods. For skip-thought, we use the pre-trained model made available by the authors, based on the book-corpus dataset [Zhu et al.2015]; for pp, once again we use the pre-trained model from the authors, based on ppdb [Ganitkevitch et al.2013]. We compare these two models against dbow trained on each of wiki and ap-news. The results are presented in Table 5, along with results for the baseline method of skip-gram and ngram.
skip-thought performs poorly: its performance is worse than the simpler method of word2vec vector averaging and ngram. dbow outperforms pp over most Q-Dup subforums, although the situation is reversed for STS. Given that pp is based on word vector averaging, these observations support the conclusion that vector averaging methods works best for shorter documents, while dbow handles longer documents better.
Although not explicitly mentioned in the original paper [Le and Mikolov2014], dbow does not learn embeddings for words in the default configuration. In its implementation (e.g. Gensim), dbow has an option to turn on word embedding learning, by running a step of skip-gram to update word embeddings before running dbow. With the option turned off, word embeddings are randomly initialised and kept at these randomised values.
Even though dbow can in theory work with randomised word embeddings, we found that performance degrades severely under this setting. An intuitive explanation can be traced back to its objective function, which is to maximise the dot product between the document embedding and its constituent word embeddings: if word embeddings are randomly distributed, it becomes more difficult to optimise the document embedding to be close to its more critical content words.
To illustrate this, consider the two-dimensional t-SNE plot [Van der Maaten and Hinton2008] of doc2vec document and word embeddings in Figure 1(a). In this case, the word learning option is turned on, and related words form clusters, allowing the document embedding to selectively position itself closer to a particular word cluster (e.g. content words) and distance itself from other clusters (e.g. function words). If word embeddings are randomly distributed on the plane, it would be harder to optimise the document embedding.
Seeing that word vectors are essentially learnt via skip-gram in dbow, we explore the possibility of using externally trained skip-gram word embeddings to initialise the word embeddings in dbow. We repeat the experiments described in Section 3, training the dbow model using the smaller in-domain document collections in each task, but this time initialise the word vectors using pre-trained word2vec embeddings from wiki and ap-news. The motivation is that with better initialisation, the model could converge faster and improve the quality of embeddings.
Results using pre-trained wiki and ap-news skip-gram embeddings are presented in Table 6. Encouragingly, we see that using pre-trained word embeddings helps the training of dbow on the smaller in-domain document collection. Across all tasks, we see an increase in performance. More importantly, using pre-trained word embeddings never harms the performance. Although not detailed in the table, we also find that the number of epochs to achieve optimal performance (based on development data) is fewer than before.
We also experimented with using pre-trained cbow word embeddings for dbow, and found similar observations. This suggests that the initialisation of word embeddings of dbow is not sensitive to a particular word embedding implementation.
To date, we have focused on quantitative evaluation of doc2vec and word2vec. The qualitative difference between doc2vec and word2vec document embeddings, however, remains unclear. To shed light on what is being learned, we select a random document from STS — tech capital bangalore costliest indian city to live in: survey — and plot the document and word embeddings induced by dbow and skip-gram using t-SNE in Figure 1.141414We plotted a larger set of sentences as part of this analysis, and found that the general trend was the same across all sentences.
For word2vec, the document embedding is a centroid of the word embeddings, given the simple word averaging method. With doc2vec, on the other hand, the document embedding is clearly biased towards the content words such as tech, costliest and bangalore, and away from the function words. doc2vec learns this from its objective function with negative sampling: high frequency function words are likely to be selected as negative samples, and so the document embedding will tend to align itself with lower frequency content words.
We used two tasks to empirically evaluate the quality of document embeddings learnt by doc2vec, as compared to two baseline methods — word2vec word vector averaging and an -gram model — and two competitor document embedding methodologies. Overall, we found that doc2vec performs well, and that dbow is a better model than dmpv. We empirically arrived at recommendations on optimal doc2vec hyper-parameter settings for general-purpose applications, and found that doc2vec performs robustly even when trained using large external corpora, and benefits from pre-trained word embeddings. To facilitate the use of doc2vec and enable replication of these results, we release our code and pre-trained models.
The Journal of Machine Learning Research, 3:1137–1155.
The Stanford CoreNLP natural language processing toolkit.In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60, Baltimore, USA.