Modern computational methods for Natural Language Processing (NLP) rely on embeddings into metric spaces such as the Euclidean, and more recently non-linear spaces such as the Wasserstein space, to achieve state-of-art performance for various tasks. In these embeddings, the semantic differences and similarities between words and documents, correspond to the distances in the represented space. For embedding into Euclidean spaces, a large body of work is based on Word2vec (mikolov2013distributed)
, where each word is represented as a vector in the Euclidean space. From these word embeddings one can further computedocument and sentence embeddings using various models ramos2003using, arora2016simple, wang2020sbert, le2014distributed, kiros2015skip, logeswaran2018efficient for higher level NLP tasks.
Instead of embedding and comparing documents in the Euclidean space, Word Mover’s Distance (WMD) kusner2015word was proposed to measure the similarity between documents in the Wasserstein space COT_book
, representing the documents with (empirical) probability distributions. Inhuang2016supervised
WMD is used for supervised learning and more recently in(yurochkin2019hierarchical), for multi-scale representation.
To understand how these models work, a lot of effort has been put into aiding interpretability of these embeddings. In AroraLLMR16 the authors proposed a linear algebraic structure to explain the polysemy of words. Recent works attempted to explain the meaning of each dimension, such as the sparse word embedding faruqui-etal-2015-sparse; panigrahi-etal-2019-word2sense and the POLAR Framework Binny2020. To make WMD embeddings interpretable, Carin2018 proposed an unsupervised topic model in the representation space.
In this work our focus is on enabling interpretable supervised WMD embeddings of the documents. Below we summarize the main contributions in this direction.
Summary of main contributions - A new approach for contrastive representation learning is proposed via enforcing a clustering promoting mechanism using a set of anchors that in turn are also learned from the data. This, in contrast to previous approaches huang2016supervised; kusner2015word, allows for interpretability, i.e. allows one to determine which words are important for a particular class. Furthermore, compared to the
Nearest Neighbour (KNN), our classification using the learned anchors is faster (for usual KNN vs for our NN using anchors), and our method can be generalized to any other supervised contrastive learning. Results on public data sets as well as a on a novel data set evaluating written scientific work by students show the superiority and utility of our method.
2 Problem formulation and approach
We are given documents each belonging to one of the classes. Each document with label is represented by two sets, and , where is the number of unique words in one document, is the -th word, is the number of times appears in . We suppress the dependency of on for sake of brevity.
Using pre-trained word embeddings from GLoVe pennington2014glove, is represented as a tuple , where and is the embedding for the word . The -th entry of is , the normalized histogram over the words in the vocabulary.
Problem statement: Given labeled data , we seek to learn a representation
such that a Nearest Neighbor (NN)-type classifier in the representation space accurately classifies the document.
Using NN in representation space requires a notion of similarity or distances between documents. We use the WMD kusner2015word, defined as follows. Given the representations of two documents, (), () when seen as empirical measures , , the WMD between and is defined as kusner2015word,
such that , and . Here is referred to as the ground cost.
Our key idea is to learn a set of anchors for some and for each class in the representation space. Anchors offer two advantages. First, they provide for direct and simple NN classification. Second, using anchors we can learn words that have discriminatory power for particular classes, thereby enabling interpretability.
2.1 Proposed approach
The representation class for is defined by and is applied to a document column-wise,
where denotes the element-wise product, obtaining a representation . Here, is the transformation matrix.
Given this set-up, our approach is to learn the anchors via contrastive learning (see Figure 1) chen2020simple; khosla2020supervised in the representation space. In contrastive learning, one defines triplets , where is the representation of a document with label , , and are representation of anchors of and respectively. Assuming a uniform measure on the support of the anchor points, we can use the WMD and as similarity, i.e. contrastive measure. We will contrast each document with all the anchors. Thus, we will have triplets for each document. To allow for end-to-end training, we use entropic regularization and use the Sinkhorn algorithm to compute the Wasserstein distance cuturi2013sinkhorn.
Given documents, in order to train the model parameters, and , we minimize the following triplet loss function hermans2017defense:
where the constant
is a margin hyperparameter. We also employ theInfoNCE loss oord2018representation to train the model
where denotes a temperature parameter.
NN classification: Once the model is trained, in order to do NN classification on a test document, we first embed it in the representation space using the learned parameters, via equation (1), compute the WMD distances between the representation and the anchors . The class represented by the anchor with the minimum distance is declared as the label.
Interpretability: To show how one can use the anchors to discover important discriminative words, we refer the reader to section 3.1 where we illustrate it with a concrete example.
For all datasets, the ground cost in WMD is set to squared Euclidean and we use the Sinkhorn algorithm for computing it COT_book. The various hyper-parameters are set using cross-validation. For the triplet loss, the margin is set to , and for the InfoNCE loss the temperature parameter is set to . We used the Adam kingma2014adam optimizer with learning rate of 0.1. We employed the pre-trained GLoVe pennington2014glove word vectors with dimension as the representation of words. To avoid overfitting, we employed regularization with parameter .
|DATASET||#Train||#Test||Bow dim||avg words|
Public Datasets:111Note: The public dataset can be found https://github.com/gaohuang/S-WMD Information about the public datasets is shown in Table 1. In Table 2 we show the results from WMD kusner2015word , supervised-WMD (S-WMD) huang2016supervised and our method. We can see our method successfully outperformed WMD and S-WMD in seven out of the eight datasets.
|Ours - triplet loss|
|Ours - InfoNCE loss|
Figure 2 shows how our model leads to interpretability. Under the contrastive loss, the difference between WMD and will be maximized. This forces the important words for a given class to be close to the anchor of its corresponding class in the representation space and further from the anchors of the other classes. Also, the common words for all classes will have a relatively similar distance to any of the anchors as they play no role in discrimination. Concretely, for the learned word representation of word (using ) and the anchor , we define the distance , where is the number of support points and are columns (support points) of the anchor . Then we define the importance value of for class as . Larger means that word is important for class . The basic idea behind the interpretation is that the learned anchor for each class can be understood as an “abstract document” for this class in the representation space. We believe that the overlap between different anchors is something that is in common for different classes. For a given class, the non-overlapping parts (with other classes) can be viewed as the important features in the representation space for this class, and we show that the words close to the non-overlapping part in the representation space are indeed the important words for a given class.
In Figure 2, we show the t-SNE visualization of learned anchors and vocabulary. We see the anchors corresponding to different classes have some overlap. But, more importantly, as shown in the right panel of Figure 2, the important words for each class generated from our method are not overlapping and are relatively far from each other.
We show the top-30 words for each class from BBCsports dataset in Figure 5. We checked the frequency of these words and we believe these words are important. For example, the Top-30 words we generated for "Cricket" were totally shown times in the dataset, and of them belong to the class “Cricket”.
Assessing written student work: This work concerns analyses by an instructor to track shifts in the quality of students’ writing due to curricular innovations. An overview of the dataset is shown in Table 3.
Human coders separated lab reports into higher and lower scores using an adapted version of the SOLO taxonomy (Biggs & Collis, 1982), which uses three features of reports to determine quality: claim complexity, scope of evidence, and consistency and closure (Authors, in prep).
For this dataset, since we have rather limited data to work with, we first combine the scores (1,2) as a low score, and scores (3,4) as a high score, yielding a binary classification task.
|Essays||Avg length||Max length||Min length||Scores|
Performance and interpretation: In Table 4 shows the accuracy of the three methods. Our method, which improves on WMD, is not quite as accurate as S-WMD neither of which provide for inerpretatability, a defining benefit of the work in this paper. The performance gap is due to the fact that our method requires learning more parameters, namely the anchors, compared to S-WMD. Since the training data is rather limited for the new dataset, our error rate is higher. This issue is not present in the public dataset. In Figure 4, we plot the top 30 words for reports with low and high scores. To make comparison, we also list top words generated by TF-IDF for lab reports with low score and high score separately in Figure 5.
|Method||WMD||S-WMD||Ours (triplet loss)|
Discussion of lab report results: The discriminatory words identified by our approach, suggest a good fit with the qualitative differences, namely, claim complexity, scope of evidence, and consistency and closure, used by human coders to make classifications. The words also suggest themes not directly coded for.
For example, differences in adjectives reflect differences in claim structure. The importance of adjectives such as positive, negative and relative, reflect the more complex claim structure in high scoring reports. While low scoring reports stated simple claims, high scoring reports compared the relative influence of competing effects (i.e. positive and negative mutations).
Another hallmark of high scoring report is qualified or conditional claims that indicate context-specificity or uncertainty. The importance of adverbs such as predominantly, largely, and disproportionately, in high-scoring reports, reflects uncertainty, expressed as of probabilistic claims, that were common in these reports. While these properties were not observed from the top words generated by TF-IDF.
The predominance of nouns and verbs that describe laboratory procedures (e.g. method, procedure, standardize) in low-scoring reports is an interesting difference not directly coded for by human coders. It is nevertheless consistent with the shift in the laboratory curriculum from an emphasis on reporting on procedures to interpreting and arguing about findings that underlies the shift from low to high scores.
Overall these findings suggest that our method captures meaningful qualitative differences originally identified by qualitative researchers.
4 Codes and Implementation
Our code can be found at https://github.com/rjiang03/Interpretable-contrastive-word-mover-s-embedding.
This research is supported by NSF RAISE 1931978. Shuchin Aeron is aso supported in part by NSF CCF:1553075, NSF ERC planning 1937057, and AFOSR FA9550-18-1-0465.