Evaluating the Utility of Document Embedding Vector Difference for Relation Learning

07/18/2019 ∙ by Jingyuan Zhang, et al. ∙ The University of Melbourne 0

Recent work has demonstrated that vector offsets obtained by subtracting pretrained word embedding vectors can be used to predict lexical relations with surprising accuracy. Inspired by this finding, in this paper, we extend the idea to the document level, in generating document-level embeddings, calculating the distance between them, and using a linear classifier to classify the relation between the documents. In the context of duplicate detection and dialogue act tagging tasks, we show that document-level difference vectors have utility in assessing document-level similarity, but perform less well in multi-relational classification.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Document-level relation learning has long played a significant role in Natural Language Processing (“NLP”), in tasks including semantic textual similarity (“STS”), natural language inference, question answering, and link prediction.

Word and document embeddings have become ubiquitous in NLP, whereby words or documents are mapped to vectors of real numbers. Building off this, the work of mikolov2013distributed and P16-1158 demonstrated the ability of word embedding offsets (“DiffVecs”) in completing word analogies (e.g. A:B :: C:-?-). For example, the DiffVec between king and queen is roughly identical to that between man and woman, indicating that the DiffVec may imply a relation of OPPOSITE-GENDER in its magnitude and direction, which offers a support of analogy prediction tasks of the form (king:queen :: man:-?-). In this paper, we evaluate the utility of document embedding methods in solving analogies in terms of document relation prediction. That is, we evaluate the task of document embedding difference (“DOCDIFFVEC”) to model document relations, in the context of two tasks: document duplication detection, and post-level dialogue act tagging. In doing so, we perform a contrastive evaluation of off-the-shelf document embedding models.

We select a range of document embedding methods that are trained in either an unsupervised or supervised manner and have been reported in recent work to perform well across a range of NLP transfer tasks. In line with P16-1158, we keep the classifier set-up used to perform the relation classification deliberately simple, in applying a simple linear-kernel support vector machine (“SVM”) to the

DocDiffVecs. Our results show that DocDiffVec has remarkable utility in binary classification tasks regardless of the simplicity of the model. However, for multi-relational classification tasks, it only marginally surpasses the baseline model, and unsupervised averaging models are superior to more complex supervised models.

2 Related Work

Recently, advanced word embeddings and neural network architectures have been increasingly used to model word sequences, achieving impressive results in contexts including machine translation, text classification, and sentiment analysis.

2.1 Word Embeddings

In work that revolutionised NLP, mikolov2013distributed proposed word2vec as a means of “pre-training” word embeddings from a large unannotated corpus of text, based on language modelling, i.e. predicting contexts from a word or words from context. Subsequently, others have proposed methods, including GloVe Pennington et al. (2014) and Paragram Wieting et al. (2015a).

word2vec (“word2vec”) is a predict-based model, in the form of either the skip-gram or cbow (continuous Bag-Of-Words) model. The skip-gram model aims to predict context from a target word, while cbow, predicts the target word from its context words. The general idea can be explained as follows: given a predicted word vector and a target word vector

, the probability of the target word conditional on the predicted word is calculated by a softmax function:

where is the set of all target word vectors. word2vec is trained to minimise the negative log-likelihood of the target word vector given its corresponding predicted word.

GloVe performs a low-rank decomposition of the corpus co-occurrence frequency matrix based on the following objective function:

where is a vector for the left context, is a vector for the right context, is the relative frequency of word in the context of word , and

is a heuristic weighting function to balance the influence of high versus low term frequencies.

Paragram (“Paragram”) is trained with supervision over the Paraphrase Database (PPDB) Ganitkevitch et al. (2013)

, such that embeddings for expressions which are paraphrases of one another have high cosine similarity, and non-paraphrase pairs have low similarity. The embedding for an expression is generated by simple averaging over the word embeddings, with the ultimate result of training the model being word embeddings.

2.2 Document Embeddings

Recently, a lot of work has focused on obtaining “universal” representations for documents. Such models vary vastly in complexity and training approaches. Two typical categories of training methodologies are unsupervised and supervised. Unsupervised document embedding methods like SkipThought Kiros et al. (2015) and FastSent Hill et al. (2016) are trained without supervision using neural networks on large corpora. Some simpler ones are based on pure arithmetic operations over word embedding vectors without training. Such models include averaging, weighted averaging, and weighted averaging with PCA projection Arora et al. (2017) (“WR”, hereafter). The WR model performs weighted averaging over word vectors according to word frequencies in the corpus, and adds an additional layer which modifies the final representation of the sentences using PCA projection.

On the other hand, supervised models are based on richer compositional architectures including deep feed-forward neural networks

Iyyer et al. (2015)

, convolutional neural networks

Kim (2014), attention-based networks Yang et al. (2016)

, and bidirectional recurrent neural networks, among which the most popular approaches use Bi-LSTMs

Palangi et al. (2016); Tang et al. (2015); Wieting et al. (2015b). For a sentence consisting of words , a bi-directional LSTM computes a set of vectors for , formed through the concatenation of a forward and backward LSTM Conneau et al. (2017)

. To combine values from each dimension of the LSTM hidden states, two common pooling methods are max pooling

Collobert and Weston (2008) and average pooling.

Such models require training data, generally in the form of paraphrastic text corpora or inference datasets.

Infersent Conneau et al. (2017) is one state-of-the-art supervised method, which is trained over natural language inference datasets SNLI Bowman et al. (2015) and MultiNLI Williams et al. (2018). The authors suggest a BiLSTM with max pooling as the best configuration for Infersent, and provide a pre-trained model which generates 4096-dimensional document embeddings.

Another method proposed by DBLP:conf/acl/GimpelW18 is also based on a BiLSTM architecture with either average pooling or max pooling, but trained on a back-translated corpus called PARANMT-50M, containing over 51 million English–English sentential paraphrase pairs based on CzEng 1.6 Bojar et al. (2016). We refer to this model as “BiLSTM-NMT” or “BiLSTM-NMT”, for the average and max pooling methods, respectively.

3 Methodology and Resources

3.1 Relation Learning

Lexical relations for words take the form of a directed binary relation between a word pair. For example, (take, took) has the relation of PAST-TENSE, and (person, people) indicates the PLURAL relation. Recent approaches to lexical relation learning based on representation learning have made noteworthy contributions to NLP tasks, including relation extraction, relation classification, and relation analogy Vylomova et al. (2016). A considerable amount of research has been dedicated to finding an explanation for the success of word embedding models in lexical relation learning, including those that focused on vector differences (DiffVec). Research by P16-1158 was the first to systematically test both the effectiveness and generalizability of DiffVec across a broad range of lexical relations. The authors performed both clustering and classification experiments on a dataset consisting of over 12,000 triples covering 15 relation types. Clustering of DiffVecs revealed that many relations formed tight clusters with clear boundaries. Based on this finding, they trained supervised models on DiffVecs to reveal their true potential in classifying lexical relations. With the addition of negative sampling, they were able to achieve impressive performance in capturing semantic and syntactic differences between words using a simple SVM model trained on DiffVecs for both open- and closed-world classification experiments. The paper also conducts cross-comparison of different methods for learning word embedding vectors to generate DiffVecs.

Correspondingly, document relations are relations between document pairs. Document relations can also be described in the form of binary relations, similar to word-level ones. For example, (I walk to school, I go to school on foot) can be viewed as having a synonym or paraphrastic relation. With research having shifted focus from word embeddings to document embeddings, more efforts have been put into learning document relations. In recent years, an increasing number of shared datasets have been generated to improve certain types of sentential relation learning, including SemEval Cer et al. (2017) and the work of Lee and Welsh (2005) for STS; SNLI Bowman et al. (2015) and MultiNLI Williams et al. (2018) for natural language inference; and WikiQA Yang et al. (2015) and QAsent Wang et al. (2007) for question-answering.

The majority of models reported to perform well on document relation learning tasks depend on a joint model to aggregate two sentence vectors in a relation tuple for relation representation. This includes the joint model from Infersent that learns entailment relations, CNN-based joint models for learning paraphrastic and question-answering relations Yin and Schütze (2015), and the widely used cosine similarity between sentence vectors. However, no systematic evaluation has been performed on using document vector offsets (DocDiffVec) alone for document relation learning. Taking our lead from DiffVec, this paper is the first to naturally extend the DiffVec evaluation paradigm to the document level.

3.2 Learning Scheme

For document relation learning, an aggregation mechanism is required to combine sentence vectors into a single relation vector . In this paper, the aggregation model is as simple as calculating by subtracting from in a given triple, with relation . That is, classification is performed over instances of form .

The objective is to train a model with all training instances, and best predict the missing from within the task domain for test relation tuples . Following Vylomova et al. (2016), the learner in our experiments is a linear kernel SVM.

In this paper, we assess the utility of DocDiffVecs in learning document relations in two scenarios: (1) document-level similarity modelling, and (2) multi-relational classification.

In a document-level similarity modelling context over unordered document pairs, there is no well-defined way of ordering the documents to perform the vector difference when calculating the vector offsets. To take an (overly) simplified example, for the sentence pair (The man put the box down, The man dropped the box) encoded into the 3-d vector pairing , and the unordered lexical relation of paraphrastic similarity, it is impossible to define a priori which of these two sentences should be the subtrahend or the minuend. Directly taking the offset will result in two possible DocDiffVecs or depending on the ordering of the two sentences. We make the simplifying assumption that for similarity modelling, the relation only depends on the magnitude and not the direction of the DocDiffVec, and therefore calculate using the element-wise absolute value for the offsets, eliminating the impact of directionality in each dimension.

4 Datasets

4.1 CQAdupstack Dataset

The cqadupstack dataset Hoogeveen et al. (2015) is focused on the tasks of question-answering and thread duplicate detection. In this paper, we only consider the thread duplication detection setting (“Q-DUP” hereafter). The dataset consists of question threads crawled from 12 StackExchange111https://stackexchange.com/ subforums. Nowadays, with the number of questions asked on Q&A forums growing dramatically, many newly-posted questions overlap in content with previously posted (and answered) questions. Q-DUP is the task of automatically identifying questions which pre-exist in the forum, and prompting the question-answerer with possible solutions from the duplicate threads, in addition to reducing duplication of workload for the forum community Hoogeveen et al. (2018)

. Each question includes a title, body content and complementary information such as the date of posting and number of votes. It also lists the thread IDs of all duplicates of each thread. The data distribution is highly skewed because only a small fraction (ranging 1.52–9.31%, depending on the forum) of threads have one or more duplicate.

4.2 CNET Forum Dataset

To evaluate the multi-relational classification utility of DocDiffVec, we use the cnet forum dataset Kim et al. (2010), and the dialogue act tagging task (“DA” hereafter). The dataset is made up of 320 threads comprising 1332 posts from four different subforums of the cnet 222http://forums.cnet.com/?tag=TOCleftColumn.0 website. Apart from textual features including post title and body, each post contains structural features such as author name and position of the post in the thread. Each post is manually labelled with one or more parent posts that it relates to, and a unique dialogue act for each link. As the forum is troubleshooting-oriented, the 12 dialogue acts present in the dataset capture the nature of the dialogue interaction, including: Question-Question (a newly posed question), Answer-Answer (a solution to a question), and Question-Correction (correction of an error in a question). In our experiments, we assume knowledge of the parent post(s) of each post, and perform only the DA tagging task. The cnet dataset has a characteristically skewed class distribution, with the majority DA label (Answer-Answer) accounting for 40.3% of post pairs in the dataset.

4.3 Data Preprocessing

All textual data in the two datasets is cleaned and tokenized using the script provided with cqadupstack.333https://github.com/D1Doris/CQADupStack We denote the values (sentence sets) for the two datasets as and , respectively. Document embedding models are then treated as black box tools that take as input each sentence and output a vector representation where is the dimensionality of the embedding regulated by the sentence encoder.

In order to compare different sentence encoders that generate and result in different DocDiffVecs, we use four representative models, two unsupervised and two supervised: word averaging model (unsupervised), WR model (unsupervised), Infersent (supervised), and BiLSTM-NMT (supervised). For the unsupervised models, we further enrich the model variety by using different pretrained word embeddings as inputs, including word2vec (the 300-dimensional version pre-trained on Google News), GloVe and Paragram

(the PARAGRAM-SL999 version). In the supervised setting, we use the publicly available Theano implementation

444https://github.com/jwieting/para-nmt-50m

to train BiLSTM-NMT, and slightly modify the code to convert it into a general-purpose sentence encoder that can vectorize arbitrary text by loading trained models. We preserve all hyperparameters and settings, and use

Paragram-SL999 word embeddings555https://www.cs.cmu.edu/~jwieting/ to initialize the input sentences according to the original paper Gimpel and Wieting (2018). We train two BiLSTM-NMT models that both output 4096 dimensional document embeddings, with max-pooling and mean pooling, respectively, for comparison. We also keep the native settings for Infersent using the original implementation,666https://github.com/facebookresearch/InferSent where they use GloVe word embeddings and a dimensionality of 4096 for output sentence vectors.

For Q-DUP,

is a binary variable indicating whether the pair of questions are duplicates or not. Generating all possible pairings

for this task leads to an intractable billions of triples, with only a fraction (roughly ) being duplicates. For efficiency, we abandon the natural data distribution and choose to keep all duplicates but subsample the non-duplicates to a feasible number, in line with earlier work on the dataset Lau and Baldwin (2016). Numbers of duplicated pairs range from around 1,000 to 4,000 depending on the subforum. We randomly allocate 90% of the duplicates to the training set and the other 10% to the test set for each subforum. We then subsample 5000 times more non-duplicates than duplicates for both training and testing in each subforum.

For DA tagging over cnet, belongs to one of the 12 interactive dialogue act tags, e.g. . All 1332 labelled training instances are used, and in the instance that a post has multiple parent posts, each is treated as a separate instance. We randomly split the data into 10 folds for cross validation.

We use scikit-learn package in Python to implement the SVM models, with default parameters.

Model AUC
Average (word2vec) 0.75
Average (GloVe) 0.78
Average (Paragram) 0.75
WR (word2vec) 0.74
WR (GloVe) 0.79
WR (Paragram) 0.75
Infersent 0.91
BiLSTM-NMT 0.90
BiLSTM-NMT 0.85
dbow (WIKI) 0.91
dbow (AP) 0.90
Table 1: AUC Scores for cqadupstack

5 Evaluation and Discussion

We conduct evaluation from two aspects for the two datasets. In order to evaluate how well DocDiffVecs capture relational differences across the different tasks, we conduct absolute performance comparisons between results produced by DocDiffVec models and the state-of-the-art models. Our intention here is to determine whether the highly simplistic and general-purpose DocDiffVec approach is competitive with methods that are customized to the task/dataset. Additionally, we are interested in the cross-comparison between document embedding models, to determine whether there are substantial empirical differences between them, and the possible causes of any differences.

5.1 Duplication Detection

For the Q-DUP task, we train an SVM model for each document embedding method over each of the 12 subforums, and evaluate using the ROC AUC score due to the extremely biased data distribution. The ROC AUC score indicates the probability that the models rank randomly-chosen positive samples before randomly-chosen negative samples. An AUC score of 1.0 indicates that the model is perfect at ranking true duplicates ahead of false duplicates, while 0.5 signifies a completely random ranking (and any value less than that a worse-than-random ranking). As the SVM classifier does not provide an explicit probability to use for ranking, we calculate a similarity score based on:

where is the distance from the instance to the positive decision boundary of the SVM, and correspond to the minimum and maximum distances among the test instances. We present the AUC results in Table 1.

For the unsupervised approaches (the top block in the table), the variance between models is slight, but there is a clear pattern that models built on

GloVe perform slightly better than those built on the other two word embedding models, with the WR compositional model showing a tiny advantage. While GloVe benefits from WR translation, the other two models do not.

In terms of the more complex supervised models (the middle block), Infersent outperforms BiLSTM-NMT with an very minor advantage and beats the unsupervised models by a large margin. While we only present aggregate numbers in the paper, across all of the individual subforums, max pooling beats mean pooling for the BiLSTM-NMT model despite all other settings being identical. This could potentially be explained by the phenomenon discussed by D17-1070, that mean pooling does not make sharp enough choices on which part of the sentence is more important. Apart from using the widely adopted BiLSTM architecture for sentence encoding, the success of the BiLSTM-NMT model in this task might also benefit from the paraphrastic training objective (optimizing a cosine similarity margin loss). The success of Infersent is not surprising because, in the joint model it computes as features to predict relations, where and are sentence vectors in a relation pair; that is, it explicitly models , which is identical to what we use for DocDiffVec. Though it is not strictly “cheating” as the model is trained on the related but non-identical NLI task, Infersent certainly has the advantage of explicitly capturing DocDiffVec as a subspace of its larger feature space. Infersent’s small advantage over BiLSTM-NMT on this task may also be attributable to the different word embeddings it uses (GloVe vs. Paragram), given that GloVe was the pick of the unsupervised methods.

To calibrate these results against the state of the art for the dataset, we compare ourselves against the best AUC results reported by W16-1609, who fine-tuned doc2vec for Q-DUP. From the different doc2vec models they proposed, we compare ourselves against the best of the dbow models, which were trained on either English Wikipedia111Using the dump dated 2015-12-01, cleaned using WikiExtractor: https://github.com/attardi/wikiextractor (“WIKI”) or AP-NEWS222A collection of Associated Press English news articles from 2009 to 2015 (“AP”) using pretrained word vectors from word2vec. Note that W16-1609 simply rank question pairs based on cosine similarity over the dbow representations. DocDiffVecs obtained from both the BiLSTM-NMT and Infersent embedding models are highly competitive with the doc2vec approach, with the Infersent model equally dbow trained on English Wikipedia with an AUC score of 0.91.

Model Score
Average (word2vec) 0.65
Average (GloVe) 0.63
Average (Paragram) 0.60
WR (word2vec) 0.41
WR (GloVe) 0.63
WR (Paragram) 0.59
Infersent 0.56
BiLSTM-NMT 0.57
BiLSTM-NMT 0.58
SVM-HMM 0.57
baseline 0.64
Table 2: Results for cnet (F1-score)

5.2 Dialogue Act Classification

We test the ability of DocDiffVec to recognise more complex and diverse relations beyond the binary duplicate detection domain, in the form of the post-to-post dialogue act (DA) tagging task. Here, we evaluate in terms of micro-averaged F1-score. According to Table 2, the unsupervised models (once again, the top block in the table) surprisingly turn the tide to outperform the supervised embedding models, with the simplest averaging model built on word2vec attaining an F1-score of 0.65.

In reality, the dialogue act tags depend heavily on structural and contextual features of the post pairs. For example, if we want to detect an answer to a question, the answer is certainly located after where the question is posted, and tends to have a different author to the question requester, as rarely does the requester propose a solution to his/her own question. Similarly, in terms of a Question-Correction relation, it is likely that the two posts have the same author. Previous approaches using CRF, SVM-HMM, ME Kim et al. (2010) and the improved versions using CRFSGD Wang et al. (2011) all make use of such features in addition to words from the post title and body of the post. As DocDiffVec models do not include those features, we compare them with the SVM-HMM models in Kim et al. (2010) that are based solely on lexical features, including lexical unigrams and bigrams, and POS tags. We also compare DocDiffVec models with the heuristic baseline of Kim et al. (2010), where the first post is always classified as a Question-Question, and all subsequent posts are classified as an Answer-Answer. This baseline achieved a reasonably high F1-score of 0.64, due to the high proportion of Answer-Answer and Question-Question tags, and the utility of positional information. Our best DocDiffVec model passes the baseline by a mere 0.01, but comfortably beats the SVM-HMM model. It was unexpected that all supervised models would perform poorly, below the baseline by quite a margin, at a similar level to the SVM-HMM model.

Note that the state of the art result for this dataset is that of Liu et al. (2017)

based on a memory-augmented CRF, with structural and post author features as side information. They achieve an F1-score of 0.78 with a much more complex supervised model, clearly above our best result, but given the simplicity and flexibility of our approach, an F-score of 0.65 is plausibly competitive.

By analyzing the confusion matrix for the

Average (word2vec) model, we found that only the Answer-Answer, Question-Question and Resolution relations are correctly recognized at an acceptable level, which are the three most common tags in the data. For rarer tags, the F1-scores approach 0, indicating that the model has limited ability to further distinguish DocDiffVecs into more specific subclasses. Also, the three well-classified tags have recall greater than precision, which suggests a tendency for the model to classify unknown or uncertain relations into dominant classes. As most Question-Question relations are associated with reentrant links (the link from the parent node in the thread is to itself), the DocDiffVec will always be , which is easy for the classifiers to detect, leading to a particularly high F1-score for the class of 0.9.

It is interesting that for the BiLSTM-NMT models, precision is quite a bit lower than recall for the Question-Question tag, which seems to be the main reason for their poor performance. Tying the result back to Q-DUP, supervised embedding models suffer from their unnecessarily strong ability to capture semantic similarity in their DocDiffVecs for the DA task. Most linked posts in this dataset are very similar in terms of topic and content, as they belong to the same question thread and discuss the same specific issue. This causes posts to have high semantic similarity within a thread, and the DocDiffVecs to be close to 0 in magnitude in vector space. This causes the prevalence of Question-Question misclassifications.

More generally, the reason why DocDiffVec models do not perform better can be ascribed to: (1) the SVM overfitting to the majority tags, due to data sparsity; (2) subtle distinctions between less common dialogue acts being difficult to make without structural features or post metadata (e.g. author, position of post), regardless of the document embedding model used; and (3) DocDiffVec being incapable of differentiating multiple dialogue acts in a single linear vector space.

Overall, we cautiously conclude that DocDiffVecs have quantifiable but ultimately limited utility for multi-relational classification, especially in contexts where extra-linguistic factors have high import.

5.3 Discussion and Future Work

Drawing the two tasks together, we can observe that DocDiffVecs have remarkable utility in document similarity modelling, but are weaker at multi-relational classification tasks. Ultimately, however, further experimentation over other tasks is required to determine how well DocDiffVec performs over tasks beyond document similarity, such as entailment, summarization (e.g. body of an article versus its title), and question-answering.

The conclusion that DocDiffVec

does not model multi-relational classification tasks well ties in with recent work on knowledge graph (“KG”) embedding, such as the

TransR model Lin et al. (2015). Traditional KG embedding models represent all relations and entities in a single semantic space, regarding a relation as a translation from a head entity to a tail entity (similar to DiffVec). Nevertheless, one semantic space is considered insufficient because each pair of entities is likely to be associated across a number of relations. To overcome the multi-relational weakness of KG embedding DiffVecs, TransR learns an exclusive vector space for each distinct relation, and shows this to result in significant improvement for KG embedding. The relations in TransR are still represented in DiffVec form but are calculated after transforming entity embeddings into a vector space customized to a given relation. Familiarly enough, when training universal document embedding models, sentences are trained to be “entities” in a semantic space. In future work, we propose to assess whether DocDiffVecs can automatically capture relations between sentences during training, despite not being explicitly trained to learn relations, as in KG models. That is, the approach of TransR is potentially also a good fix for multi-relational learning at the document level.

Finally, instead of using a simple linear kernel SVM to classify DocDiffVecs, it would of course be possible to use more sophisticated classifiers, such as deep neural networks, to explore more complex, non-linear composition of the directions and magnitudes encoded in DocDiffVecs.

6 Conclusions

Taking inspiration from the work of word-level embedding vector offsets for lexical relation learning, this paper is the first to evaluate document embedding vector difference vectors to model document-to-document relations. By using a simple SVM model to classify DocDiffVecs, we found that BiLSTM-based document embedding models generate DocDiffVecs that are highly useful for textual similarity, through experiments on a document duplication detection task, competitive with the state of the art. At the same time, we found that DocDiffVecs obtained from simple averaging of word embeddings outperform an informed baseline and complex neural sentence encoders for multi-relational classification, in the context of dialogue act tagging. Overall, we conclude that DocDiffVec has reasonable utility for document relation learning.

References