Interpretable contrastive word mover's embedding

by   Ruijie Jiang, et al.
Tufts University

This paper shows that a popular approach to the supervised embedding of documents for classification, namely, contrastive Word Mover's Embedding, can be significantly enhanced by adding interpretability. This interpretability is achieved by incorporating a clustering promoting mechanism into the contrastive loss. On several public datasets, we show that our method improves significantly upon existing baselines while providing interpretation to the clusters via identifying a set of keywords that are the most representative of a particular class. Our approach was motivated in part by the need to develop Natural Language Processing (NLP) methods for the novel problem of assessing student work for scientific writing and thinking - a problem that is central to the area of (educational) Learning Sciences (LS). In this context, we show that our approach leads to a meaningful assessment of the student work related to lab reports from a biology class and can help LS researchers gain insights into student understanding and assess evidence of scientific thought processes.



There are no comments yet.


page 1

page 2

page 3

page 4


Automatic coding of students' writing via Contrastive Representation Learning in the Wasserstein space

Qualitative analysis of verbal data is of central importance in the lear...

Simple Contrastive Representation Adversarial Learning for NLP Tasks

Self-supervised learning approach like contrastive learning is attached ...

Adversarial Training with Contrastive Learning in NLP

For years, adversarial training has been extensively studied in natural ...

DisCo: Effective Knowledge Distillation For Contrastive Learning of Sentence Embeddings

Contrastive learning has been proven suitable for learning sentence embe...

Applying Recent Innovations from NLP to MOOC Student Course Trajectory Modeling

This paper presents several strategies that can improve neural network-b...

WMDecompose: A Framework for Leveraging the Interpretable Properties of Word Mover's Distance in Sociocultural Analysis

Despite the increasing popularity of NLP in the humanities and social sc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern computational methods for Natural Language Processing (NLP) rely on embeddings into metric spaces such as the Euclidean, and more recently non-linear spaces such as the Wasserstein space, to achieve state-of-art performance for various tasks. In these embeddings, the semantic differences and similarities between words and documents, correspond to the distances in the represented space. For embedding into Euclidean spaces, a large body of work is based on Word2vec (mikolov2013distributed)

, where each word is represented as a vector in the Euclidean space. From these word embeddings one can further compute

document and sentence embeddings using various models ramos2003using, arora2016simple, wang2020sbert, le2014distributed, kiros2015skip, logeswaran2018efficient for higher level NLP tasks.

Instead of embedding and comparing documents in the Euclidean space, Word Mover’s Distance (WMD) kusner2015word was proposed to measure the similarity between documents in the Wasserstein space COT_book

, representing the documents with (empirical) probability distributions. In


WMD is used for supervised learning and more recently in 

(yurochkin2019hierarchical), for multi-scale representation.

To understand how these models work, a lot of effort has been put into aiding interpretability of these embeddings. In AroraLLMR16 the authors proposed a linear algebraic structure to explain the polysemy of words. Recent works attempted to explain the meaning of each dimension, such as the sparse word embedding faruqui-etal-2015-sparse; panigrahi-etal-2019-word2sense and the POLAR Framework Binny2020. To make WMD embeddings interpretable, Carin2018 proposed an unsupervised topic model in the representation space.

In this work our focus is on enabling interpretable supervised WMD embeddings of the documents. Below we summarize the main contributions in this direction.

Summary of main contributions - A new approach for contrastive representation learning is proposed via enforcing a clustering promoting mechanism using a set of anchors that in turn are also learned from the data. This, in contrast to previous approaches huang2016supervised; kusner2015word, allows for interpretability, i.e. allows one to determine which words are important for a particular class. Furthermore, compared to the

Nearest Neighbour (KNN), our classification using the learned anchors is faster (

for usual KNN vs for our NN using anchors), and our method can be generalized to any other supervised contrastive learning. Results on public data sets as well as a on a novel data set evaluating written scientific work by students show the superiority and utility of our method.

2 Problem formulation and approach

We are given documents each belonging to one of the classes. Each document with label is represented by two sets, and , where is the number of unique words in one document, is the -th word, is the number of times appears in . We suppress the dependency of on for sake of brevity.

Using pre-trained word embeddings from GLoVe pennington2014glove, is represented as a tuple , where and is the embedding for the word . The -th entry of is , the normalized histogram over the words in the vocabulary.

Problem statement: Given labeled data , we seek to learn a representation

such that a Nearest Neighbor (NN)-type classifier in the representation space accurately classifies the document.

Using NN in representation space requires a notion of similarity or distances between documents. We use the WMD kusner2015word, defined as follows. Given the representations of two documents, (), () when seen as empirical measures , , the WMD between and is defined as kusner2015word,

such that , and . Here is referred to as the ground cost.

Our key idea is to learn a set of anchors for some and for each class in the representation space. Anchors offer two advantages. First, they provide for direct and simple NN classification. Second, using anchors we can learn words that have discriminatory power for particular classes, thereby enabling interpretability.

2.1 Proposed approach

The representation class for is defined by and is applied to a document column-wise,


where denotes the element-wise product, obtaining a representation . Here, is the transformation matrix.

Figure 1: Schematic illustration of contrastive learning. Different colors means different classes. Top: supervised contrastive learning. Bottom: interpretable contrastive learning with learnable anchors.

Given this set-up, our approach is to learn the anchors via contrastive learning (see Figure 1) chen2020simple; khosla2020supervised in the representation space. In contrastive learning, one defines triplets , where is the representation of a document with label , , and are representation of anchors of and respectively. Assuming a uniform measure on the support of the anchor points, we can use the WMD and as similarity, i.e. contrastive measure. We will contrast each document with all the anchors. Thus, we will have triplets for each document. To allow for end-to-end training, we use entropic regularization and use the Sinkhorn algorithm to compute the Wasserstein distance cuturi2013sinkhorn.

Given documents, in order to train the model parameters, and , we minimize the following triplet loss function hermans2017defense:

where the constant

is a margin hyperparameter. We also employ the

InfoNCE loss oord2018representation to train the model

where denotes a temperature parameter.

NN classification: Once the model is trained, in order to do NN classification on a test document, we first embed it in the representation space using the learned parameters, via equation (1), compute the WMD distances between the representation and the anchors . The class represented by the anchor with the minimum distance is declared as the label.

Interpretability: To show how one can use the anchors to discover important discriminative words, we refer the reader to section 3.1 where we illustrate it with a concrete example.

3 Evaluation

For all datasets, the ground cost in WMD is set to squared Euclidean and we use the Sinkhorn algorithm for computing it COT_book. The various hyper-parameters are set using cross-validation. For the triplet loss, the margin is set to , and for the InfoNCE loss the temperature parameter is set to . We used the Adam kingma2014adam optimizer with learning rate of 0.1. We employed the pre-trained GLoVe pennington2014glove word vectors with dimension as the representation of words. To avoid overfitting, we employed regularization with parameter .

DATASET #Train #Test Bow dim avg words
BBCSPORT 517 220 13243 117 5
TWITTER 2176 932 6344 9.9 3
RECIPE 3059 1311 5703 48.5 15
OHSUMED 3999 5153 31789 59.2 10
CLASSIC 4965 2128 24277 38.6 4
REUTERS 5485 2189 22425 37.0 8
AMAZON 5600 2400 42063 45.0 4
20NEWS 11293 7528 29671 72 20
Table 1: Public dataset characteristics.

Public Datasets:111Note: The public dataset can be found Information about the public datasets is shown in Table 1. In Table 2 we show the results from WMD kusner2015word , supervised-WMD (S-WMD) huang2016supervised and our method. We can see our method successfully outperformed WMD and S-WMD in seven out of the eight datasets.

Ours - triplet loss
Ours - InfoNCE loss
Table 2: Classification error rate for our methods and other baselines.

3.1 Interpretation

Figure 2 shows how our model leads to interpretability. Under the contrastive loss, the difference between WMD and will be maximized. This forces the important words for a given class to be close to the anchor of its corresponding class in the representation space and further from the anchors of the other classes. Also, the common words for all classes will have a relatively similar distance to any of the anchors as they play no role in discrimination. Concretely, for the learned word representation of word (using ) and the anchor , we define the distance , where is the number of support points and are columns (support points) of the anchor . Then we define the importance value of for class as . Larger means that word is important for class . The basic idea behind the interpretation is that the learned anchor for each class can be understood as an “abstract document” for this class in the representation space. We believe that the overlap between different anchors is something that is in common for different classes. For a given class, the non-overlapping parts (with other classes) can be viewed as the important features in the representation space for this class, and we show that the words close to the non-overlapping part in the representation space are indeed the important words for a given class.

Figure 2: t-SNE visualization of vocabulary and support points for class representation. Left panel shows all vocabulary and class representation, right panel shows most important words and class representation.

In Figure 2, we show the t-SNE visualization of learned anchors and vocabulary. We see the anchors corresponding to different classes have some overlap. But, more importantly, as shown in the right panel of Figure 2, the important words for each class generated from our method are not overlapping and are relatively far from each other.

We show the top-30 words for each class from BBCsports dataset in Figure 5. We checked the frequency of these words and we believe these words are important. For example, the Top-30 words we generated for "Cricket" were totally shown times in the dataset, and of them belong to the class “Cricket”.

Figure 3: The Top-30 words of each class on the BBCSports dataset.

Assessing written student work: This work concerns analyses by an instructor to track shifts in the quality of students’ writing due to curricular innovations. An overview of the dataset is shown in Table 3.

Human coders separated lab reports into higher and lower scores using an adapted version of the SOLO taxonomy (Biggs & Collis, 1982), which uses three features of reports to determine quality: claim complexity, scope of evidence, and consistency and closure (Authors, in prep).

For this dataset, since we have rather limited data to work with, we first combine the scores (1,2) as a low score, and scores (3,4) as a high score, yielding a binary classification task.

Essays Avg length Max length Min length Scores
146 512 1449 95 1-4
Table 3: Overview of the lab report dataset.

Performance and interpretation: In Table 4 shows the accuracy of the three methods. Our method, which improves on WMD, is not quite as accurate as S-WMD neither of which provide for inerpretatability, a defining benefit of the work in this paper. The performance gap is due to the fact that our method requires learning more parameters, namely the anchors, compared to S-WMD. Since the training data is rather limited for the new dataset, our error rate is higher. This issue is not present in the public dataset. In Figure 4, we plot the top 30 words for reports with low and high scores. To make comparison, we also list top words generated by TF-IDF for lab reports with low score and high score separately in Figure 5.

Method WMD S-WMD Ours (triplet loss)
Error rate
Table 4: Error rate for lab report data
Figure 4: The Top-30 words of each class for the lab reports dataset generated by our method.
Figure 5: The Top-10 words of each class for the lab reports dataset generated by TF-IDF.

Discussion of lab report results: The discriminatory words identified by our approach, suggest a good fit with the qualitative differences, namely, claim complexity, scope of evidence, and consistency and closure, used by human coders to make classifications. The words also suggest themes not directly coded for.

For example, differences in adjectives reflect differences in claim structure. The importance of adjectives such as positive, negative and relative, reflect the more complex claim structure in high scoring reports. While low scoring reports stated simple claims, high scoring reports compared the relative influence of competing effects (i.e. positive and negative mutations).

Another hallmark of high scoring report is qualified or conditional claims that indicate context-specificity or uncertainty. The importance of adverbs such as predominantly, largely, and disproportionately, in high-scoring reports, reflects uncertainty, expressed as of probabilistic claims, that were common in these reports. While these properties were not observed from the top words generated by TF-IDF.

The predominance of nouns and verbs that describe laboratory procedures (e.g. method, procedure, standardize) in low-scoring reports is an interesting difference not directly coded for by human coders. It is nevertheless consistent with the shift in the laboratory curriculum from an emphasis on reporting on procedures to interpreting and arguing about findings that underlies the shift from low to high scores.

Overall these findings suggest that our method captures meaningful qualitative differences originally identified by qualitative researchers.

4 Codes and Implementation


This research is supported by NSF RAISE 1931978. Shuchin Aeron is aso supported in part by NSF CCF:1553075, NSF ERC planning 1937057, and AFOSR FA9550-18-1-0465.