SynWMD: Syntax-aware Word Mover's Distance for Sentence Similarity Evaluation

by   Chengwei Wei, et al.

Word Mover's Distance (WMD) computes the distance between words and models text similarity with the moving cost between words in two text sequences. Yet, it does not offer good performance in sentence similarity evaluation since it does not incorporate word importance and fails to take inherent contextual and structural information in a sentence into account. An improved WMD method using the syntactic parse tree, called Syntax-aware Word Mover's Distance (SynWMD), is proposed to address these two shortcomings in this work. First, a weighted graph is built upon the word co-occurrence statistics extracted from the syntactic parse trees of sentences. The importance of each word is inferred from graph connectivities. Second, the local syntactic parsing structure of words is considered in computing the distance between words. To demonstrate the effectiveness of the proposed SynWMD, we conduct experiments on 6 textual semantic similarity (STS) datasets and 4 sentence classification datasets. Experimental results show that SynWMD achieves state-of-the-art performance on STS tasks. It also outperforms other WMD-based methods on sentence classification tasks.


page 1

page 2

page 3

page 4


A Multi-Task Approach for Disentangling Syntax and Semantics in Sentence Representations

We propose a generative model for a sentence that uses two latent variab...

Sentence Structure and Word Relationship Modeling for Emphasis Selection

Emphasis Selection is a newly proposed task which focuses on choosing wo...

Introducing Syntactic Structures into Target Opinion Word Extraction with Deep Learning

Targeted opinion word extraction (TOWE) is a sub-task of aspect based se...

Word Rotator's Distance: Decomposing Vectors Gives Better Representations

One key principle for assessing semantic similarity between texts is to ...

Representing Verbs with Rich Contexts: an Evaluation on Verb Similarity

Several studies on sentence processing suggest that the mental lexicon k...

LS-Tree: Model Interpretation When the Data Are Linguistic

We study the problem of interpreting trained classification models in th...

Inducing Syntactic Trees from BERT Representations

We use the English model of BERT and explore how a deletion of one word ...

1 Introduction

Sentence similarity evaluation has a wide range of applications in natural language processing, such as semantic similarity computation


, text generation evaluation

(zhao2019moverscore; bert-score), and information retrieval (aliguliyev2009new; wang2015faq)

. Methods for sentence similarity evaluation can be categorized into two main classes: 1) sentence-embedding-based methods and 2) word-alignment-based methods. The former finds vector representations of sentences and calculates the similarity of two sentences by applying a distance measure such as the cosine or

distance. The latter operates at the word level and uses the alignment cost of corresponding words in two sentences as the sentence similarity measure.

As one of the word-alignment-based methods, Word Mover’s Distance (WMD) kusner2015word formulates text similarity evaluation as a minimum-cost flow problem. It finds the most efficient way to align the information between text sequences through a flow network defined by word-level similarities. By assigning flows to individual words, WMD computes text dissimilarity as the minimum cost of moving words’ flows from one sentence to another based on pre-trained word embeddings. WMD is interpretable as text dissimilarity is calculated as the distance between words in two text sequences.

However, a naive WMD method does not perform well on sentence similarity evaluation for several reasons. First, WMD assigns word flow based on words’ frequency in a sentence. This frequency-based word weighting scheme is weak in capturing word importance when considering the statistics of the whole corpus. Second, the distance between words solely depends on the embedding of isolated words without considering the contextual and structural information of input sentences. Since the meaning of a sentence depends on individual words as well as their interaction, simply considering the alignment between individual words is deficient in evaluating sentence similarity. In this work, we propose an enhanced WMD method called Syntax-aware Word Mover’s Distance (SynWMD). It exploits the structural information of sentences to improve the naive WMD for sentence similarity evaluation.

A syntactic parse tree represents a sentence using a tree structure. It encodes the syntax information of words and the structure information of a sentence. The dependency parse tree (see an example in Fig. 1) is one type of the syntactic parse tree. Each node in the tree represents a word, and an edge represents the dependency relation of two connected words. Thus, words’ related contexts can be well captured by the structures of the dependency parse tree. For example, dog in Fig. 1 is one of the most related contexts of found as its objective. Such a relationship can be easily spotted. In contrast, skinny and fragile are not directly related to found because they are the modifiers of dog. They are far away from found in the dependency parse tree, although they are close to found in the sequential order. The dependency parse tree provides valuable information in semantic modeling and has been proven useful in various NLP applications, such as word embedding (levy2014dependency; WEI2022), semantic role labeling (strubell2018linguistically), machine translation (nguyen2020tree), and text similarity tasks quan2019efficient; wang2020structural.

SynWMD incorporates the dependency parse tree technique in both word flow assignment and word distance modeling to improve the performance on sentence similarity evaluation. This work has the following three major contributions.

Figure 1: The structure of a dependency parsing tree for an exemplary sentence: “He found a skinny and fragile dog in his backyard.”
  • A new syntax-aware word flow calculation method is proposed. Words are first represented as a weighted graph based on the co-occurrence statistics obtained by dependency parsing trees. Then, a PageRank-based algorithm is used to infer word importance.

  • The word distance model in WMD is enhanced by the context extracted from dependency parse trees. The contextual information of words and structural information of sentences are explicitly modeled as additional subtree embeddings.

  • We conduct extensive experiments on semantic textual similarity tasks and k-nearest neighbor sentence classification tasks to evaluate the effectiveness of the proposed SynWMD. The code for SynWMD is available at

The rest of the paper is organized as follows. Related work is reviewed in Sec. 2. SynWMD is proposed in Sec. 3. Experimental results are shown in Sec. 4. Finally, concluding remarks are given in Sec. 5

2 Related Work

Recent studies on sentence similarity evaluation can be classified into two main categories: sentence-embedding-based methods and word-alignment-based methods. They are reviewed below.

2.1 Sentence Embedding

One way to assess sentence similarity is through sentence embedding. That is, a sentence is first encoded into a vector with an encoder. The similarity of two sentences is then inferred from the distance of their embedded vectors, where a simple distance metric such as the cosine or distance can be used. As to sentence embedding methods, a simple and fast one is to pool word embeddings. Several weighting schemes (arora2017simple; wang2020sbert) were adopted by pooling-based methods for simple sentence embeddings. Yet, there is an anisotropic problem in word-embedding-based pooling methods; namely, the associated embeddings narrow in a cone region of the vector space (ethayarajh-2019-contextual). It limits embeddings’ capability in sentence similarity evaluation. To address this limitation, post-processing techniques were proposed to alleviate the anisotropic problem. For example, principal components removal (mu2017all), BERT-flow (li2020sentence), and BERT-whitening (su2021whitening)

can make sentence embedding more uniformly distributed so as to enhance the performance on sentence similarity assessment. Recently, methods

(reimers2019sentence; gao2021simcse) fine-tune pre-trained models on labeled data or use self-supervised contrastive learning to achieve superior performance on sentence similarity tasks.

2.2 Word Alignment

Alignment-based methods measure the word matching degree for sentence similarity evaluation. WMD is a popular alignment-based method. Its extensions are widely used in text similarity tasks. For example, Sentence Mover’s Similarity targets the similarity measure of long and multi-sentence text sequences clark2019sentence. They use both word embedding and sentence embedding to measure text similarity. Word Rotator’s Distance yokoi2020word shows that the norm of word embedding encodes word importance while the angle between two word embeddings captures word similarity. Consequently, they assign word flow based on the norm of word embedding and compute the cosine distance for the similarity measure. Recursive Optimal Transport wang2020structural is a structure-aware WMD method. It uses a binary or a dependency parse tree to partition a sentence into substructures of multiple levels. Then, text similarity is recursively calculated by applying WMD to substructures at the same level. Yet, since there is no interaction between substructures at different levels, its capability of sentence similarity measure can be affected.

MoverScore zhao2019moverscore and BERTScore bert-score are two newly developed alignment-based methods using contextual word embeddings. Built upon the same concept as WMD, MoverScore uses the Inverse Document Frequency (IDF) to assign word flow so that less frequent words get higher flow weights. Furthermore, instead of adopting static word embedding, it uses contextual word embedding. It incorporates the word’s contextual information in word embedding implicitly, which enables the distance measure between words more accurately. Unlike WMD which considers the matching degree between a word in one sentence and all words in the other sentence, BERTScore uses the greedy match between words, where each word is only matched to its most similar word in the other sentence. Both MoverScore and BERTScore offer state-of-the-art performance on text generation evaluation.

3 Proposed SynWMD

We first briefly review WMD in this section. We then introduce two syntax-aware components in SynWMD to improve WMD. They are Syntax-aware Word Flow (SWF) and Syntax-aware Word Distance (SWD).

3.1 Word Mover’s Distance

Inspired by the Wasserstein metric, WMD measures text similarity using the optimal transport distance. It first utilizes pre-trained word embeddings to compute the distance between words in two text sequences. Let be the embedding of word . WMD defines the distance between word and word as , which is also referred to as the transport cost from word to word . Next, WMD assigns a flow to each word. The amount of flow of word is defined as the normalized word occurrence rate in a single text sequence:


where is the total word count of a text sequence. Then, WMD measures the dissimilarity of two texts as the minimum cumulative transport cost of moving the flow of all words from one text sequence to the other. It can be formulated as the following constrained optimization problem:


subject to:


where and are the sets of words in text sequence and , respectively, and represents the amount of the flow that travel from word to word

, which is a variable to be determined. The above constrained optimization problem can be solved by linear programming.

WMD has two main shortcomings. First, important words in a sentence should be assigned a higher flow in Eq. (2). Yet, WMD assigns word flow according to word occurrence rate in a sentence. This simple scheme cannot capture word importance in a sentence accurately. Second, the transport cost between two words is solely decided by their word embeddings. Nevertheless, the meaning of a word may be affected by its context and the meaning of a sentence can be affected by the structure of word combinations. It is desired to develop more effective schemes. They are elaborated in Secs. 3.2 and 3.3.

3.2 Syntax-aware Word Flow

Important words in two sentences can largely decide their similarity. As given in Eq. (2), a word with a higher flow has a greater impact on WMD results. Thus, a more important word in a text sequence should be assigned with a higher flow. We propose an enhanced word flow assignment scheme, called syntax-aware word flow (SWD). Simply speaking, we collect the co-occurrence times of words in dependency parse trees from the whole dataset, and get word importance based on the co-occurrence statistics for flow assignment.

The computation of SWD is detailed below.

  1. Parse all sentences in a dataset and count the co-occurrence time of two words if they appear in a parse tree within -hop. The co-occurrence count is further weighted by the distance between two words in a parse tree; namely, it is divided by the hop number between two words.

  2. Build a weighted graph for the dataset, where each node corresponds to a word and the edge of two connected words has a weight of their co-occurrence time as computed by Step 1. Clearly, words with higher edge weight frequently co-occur with other words in the dataset. They have less novelty and importance to the sentence. Based on this assumption, a word with a higher total edge weight should be assigned lower word flow.

  3. Use the weighted PageRank algorithm (page1999pagerank)

    to count all edge weights of a node, which gives a rough estimate of node importance, and assign the inverse of the PageRank value as its word flow.

The last step can be written mathematically as


where is the edge weight between word and word , represents the PageRank value of word , and is a parameter used to control the smoothness of word flow. In this way, SWF can assign a word that co-occurs with other words in the parse tree more frequently a lower flow.

Figure 2: Illustration of the shortcoming of distance calculation in WMD and an improved solution, SWD. In SWD, the distance between words is decided by both word embeddings and subtree embeddings
Figure 3: 1&2-hop subtrees with open as the parent node in the dependency parsing tree for sentence: “I am not sure if you can open a bank account in France”. Stopwords are ignored.

3.3 Syntax-aware Word Distance

In WMD, the distance between words is called the transport cost. It is computed using static word embedding without considering any word contextual information or structural information of the sentence. An example given in Fig. 2 is used to illustrate its shortcoming. Two identical words bank in the two sentences have the distance of zero in WMD. However, they do not have the same meaning because of different contexts. We can exploit contextual word embedding to alleviate this problem, such as BERT-based models. Here, we propose a syntax-aware word distance (SWD). Simply speaking, SWD uses the dependency parse tree to find the most related context of words and incorporates this information in word distance calculation. SWD can be applied to both static and contextual word embeddings to improve the performance of WMD.

The procedure of SWD is detailed below.

  1. Generate candidate subtrees from a dependency parse tree. For each word in a tree, we treat it as a parent node, and use it and its connections to m-hop children to form the subtrees, where is a hyper-parameter. With children from different hops, the context information from multiple levels can be extracted by the subtrees. Fig. 3 shows 1-hop and 2-hop subtrees where the word ”open” is the parent node.

  2. Collect all subtrees that contain the word as its context. Then obtain the subtree embedding as the weighted average of all its word embeddings.

  3. Incorporate the context of the target word in the word distance calculation. As shown in Fig. 2, besides distances between word embeddings, distances between subtree embeddings are also considered.

For the last step, the syntax-aware word distance between words and can be computed by


where and are the sets of subtrees that contain word and , respectively, and is a parameter controlling the amount of contextual and structural information to be incorporated. The cosine distance,


is used to measure the distance between word embeddings and subtree embeddings.

4 Experiments

We evaluate SynWMD on six semantic textual similarity datasets and four sentence classification datasets with the k-nearest neighbor classifier. In all experiments, the sentence parse trees are obtained using the Stanza package (qi2020stanza).

4.1 Semantic Textual Similarity

Embeddings Methods STS12 STS13 STS14 STS15 STS16 STS-B Avg.
word2vec(avg.) Sent. Emb. 55.28 70.09 65.53 75.29 68.73 65.17 66.68
BERT(first-last avg.) 39.70 59.38 49.67 66.03 66.19 53.87 55.81
BERT-flow 58.40 67.10 60.85 75.16 71.22 68.66 66.90
BERT-whitening 57.83 66.90 60.90 75.08 71.31 68.24 66.71
CT-BERT 61.63 76.80 68.47 77.50 76.48 74.31 72.53
SimSCE-BERT 68.40 82.41 74.38 80.91 78.56 76.85 76.92
word2vec WMD 58.12 58.78 60.16 71.52 66.56 63.65 63.13
WMD 54.82 61.42 60.71 72.67 66.90 62.49 63.30
WRD 56.72 64.74 63.44 75.99 69.06 65.26 65.87
WMD+IDF 60.36 67.01 63.06 72.41 68.30 65.91 66.18
WMD+IDF 57.64 69.25 63.81 73.50 68.83 65.51 66.61
SynWMD 60.24 74.71 66.10 75.94 69.54 66.24 68.80 ()
SynWMD 60.30 75.43 66.22 75.95 70.06 66.65 69.10 ()
BERT(first-last) WMD 53.03 58.96 56.79 72.11 63.56 61.01 60.91
WMD 55.38 58.51 56.93 72.81 64.47 61.80 61.65
WRD 49.93 63.48 57.63 72.04 64.11 61.92 61.52
BERTScore 61.32 73.00 66.52 78.47 73.43 71.77 70.75
WMD+IDF 61.19 68.67 63.72 76.87 70.16 69.56 68.36
WMD+IDF 63.79 69.25 64.51 77.58 71.70 70.69 69.59
SynWMD 66.34 77.08 68.96 79.13 74.05 74.06 73.27 ()
SynWMD 66.74 79.38 69.76 78.77 75.52 74.81 74.16 ()
SimCSE-BERT WMD 64.66 79.72 73.12 81.25 76.69 77.53 75.50
WMD 65.43 80.00 73.35 81.21 76.97 77.18 75.69
WRD 64.80 80.97 74.13 80.71 76.68 78.47 75.96
BERTScore 66.31 82.87 75.66 83.14 79.16 80.03 77.86
WMD+IDF 67.35 81.36 74.56 82.29 78.12 79.18 77.14
WMD+IDF 68.47 81.76 74.98 82.30 78.29 78.98 77.46
SynWMD 70.20 83.36 76.17 83.16 78.81 80.02 78.62 ()
SynWMD 70.27 83.44 76.19 83.21 78.83 79.98 78.66 ()
Table 1: Spearman’s () correlation comparison of unsupervised methods, where the best results of each word embedding are displayed in boldface. The number in the bracket show the performance gain or loss of our methods as compared with WMD+IDF. Results of [] are taken from (gao2021simcse).


. Semantic similarity tasks are widely used to evaluate sentence similarity assessment methods. Here, we consider six semantic textual similarity (STS) datasets, including STS2012-16 and STS-Benchmark. Sentence pairs in STS are extracted from a wide range of domains such as news, web forum, and image captions. They are annotated with similarity scores by humans. Each STS dataset contains several subsets on different topics. Since it is likely to have data from different topics in real-world scenarios, we apply the “all setting” evaluation for STS2012-16 as mentioned in

(gao2021simcse). The similarity scores of the sentence pairs in different subsets are concatenated and the overall Spearman’s correlation is reported.

Benchmarking Methods. We choose the following benchmarking methods.

  • Sentence-embedding-based methods: 1) average methods: the average of word2vec embedding (mikolov2013efficient) and the average of the first and last layers of BERT (devlin2018bert), 2) post-processing methods: BERT-flow (li2020sentence) and BERT-whitening (su2021whitening), 3) contrastive learning methods: CT-BERT (carlsson2020semantic) and SimCSE-BERT (gao2021simcse).

  • Word-alignment-based methods: original WMD, Word Rotator’s Distance (yokoi2020word), BERTScore (bert-score), and WMD with IDF weights as the baselines. For exhaustive comparison, WMD using the l2 and cosine distance are both reported. Both non-contextual and contextual word embeddings are chosen as backbone models. They are word2vec, pre-trained BERT, and SimCSE.

Experimental Setup. In the implementation of SWF, we count word co-occurrence in dependency parse trees if they are within 3 hops, and set the smooth term . In the implementation of SWD, we create subtrees with child nodes of no more than 3 hops. We set for word2vec and SimCSE word embeddings and for BERT word embedding.

Isotropic Processing. It is observed in ethayarajh-2019-contextual

that the average cosine similarity between randomly sampled words with pre-trained contextual word embedding is high. This implies that pre-trained contextual word embeddings are confined to a cone space, and they are not isotropic. It is also shown in

gao2021simcse; evalrank_2022 that the anisotropic property of pre-trained contextual word embedding hurts its performance in sentence similarity tasks severely. Post-processing methods (e.g., whitening) make BERT embedding less anisotropic in the embedding space and improves the performance in semantic similarity tasks. Thus, when BERT embeddings are used in the experiments, we perform the whitening operation on the word level for all word-alignment-based methods.

Results. We compare a wide range of methods on 6 STS datasets and report their Spearman’s correlation results in Table 1. For all word embeddings, WMD and WMD+IDF perform better with the cosine distance than the l-2 distance. This indicates that the cosine distance is a better metric for STS datasets. Furthermore, the word flow assignment with the IDF weight can enhance the performance of WMD. As to our proposed method, SynWMD+SWF outperforms other alignment-based methods by a substantial margin. SynWMD+SWF+SWD can improve SynWMD’s performance even more. This is especially obvious for word2vec and BERT embeddings. Under the same word embedding, SynWMD always outperforms sentence embedding methods, including the state-of-the-art unsupervised method, SimCSE.

(a) word2vec
(b) BERT
Figure 4: The average Spearman’s correlation curves on STS datasets as a function of hop sizes or window sizes for three word embeddings: (a) word2vec, (b) BERT and (c) SimCSE-BERT.

4.2 Further Analysis on STS

We perform an ablation study on SynWMD to offer a better understanding of its working principle in this subsection.

Effect of the hop size. We study the sensitivity of hop sizes, , in collecting word co-occurrence statistics in SWF. The blue curves in Fig. 4 show the average performance trend with different hop sizes on STS datasets. We see from the figure that SynWMD+SWF with a larger hop size gives better performance. This is because more relationships between words are incorporated for a larger . However, the performance gain saturates as .

Difference between parse tree and linear context in SWF. SWF collects co-occurrence statistics from dependency parse trees, which are well-organized structures of sentences. One can also use a sliding window to collect co-occurrence statistics from linear contexts and build the weighted graph. The differences between these two schemes are shown in Fig. 4. We see from the figure that the dependency parse tree in SWF outperforms the sliding window. This is because the dependency parse tree provides a powerful syntactic structure in collecting word co-occurrence statistics.

Difference between subtree and n-grams in SWD.

When collecting contextual information from words’ neighbors in SWD, one can replace subtrees with n-grams in Eq. (

6). We study the difference between subtrees and n-grams with BERT embeddings. We generate 2-grams and 3-grams so that the number of n-grams has the same order of magnitude as subtrees’ in our experiments. All other experimental settings remain the same. The performance difference between subtree and n-grams is shown in Table 2. We can see from the table that the sentence structural information does perform better than n-gram features.

Datasets n-gram subtree
STS12 66.37 66.64
STS13 78.08 79.40
STS14 69.36 69.75
STS15 79.29 78.82
STS16 74.41 75.51
STS-B 74.67 74.93
Avg. 73.70 74.18
Table 2: Comparison of Spearman’s () correlation of using the subtree and the n-gram in SWD.

Effect of using different backbone word embedding models. As shown in Table 1, there is more performance improvement by applying SWD to word2vec and BERT word embeddings but less to SimCSE. One possible explanation for this phenomenon is that SimCSE word embeddings in a sentence tend to be similar. When words from a sentence have close embeddings, words and their subtrees are expected to have close embeddings. As a result, word distances keep a similar ratio even with the subtree distance, and results of the constrained optimization problem, i.e., Eq. (2), do not change much. To verify this point, we calculate the averaged pairwise cosine distance of words in a sentence with three word embeddings and show the results in Fig. 5. We see that BERT has the largest average distance while SimCSE has the smallest. This is consistent with their performance improvement.

Figure 5: The averaged pairwise cosine distance of words in a sentence of STS datasets with three embeddings.

4.3 Sentence Classification

To further validate the effectiveness of SynWMD, we perform experiments on 4 sentence classification datasets.

Datasets. We choose MR, CR, SST2, and SST5 sentence classification datasets from SentEval (conneau2018senteval). They are elaborated on below.

  • MR: a movie review dataset where sentences are labeled as positive or negative sentiment polarities.

  • CR: a product review dataset with positive and negative sentence reviews.

  • SST2 & SST5: Both are movie review datasets. SST2 has two labels (positive and negative), while SST5 has five labels (very positive, positive, neutral, negative, and very negative).

Note that WMD-based methods are not suitable for the k-nearest neighbor sentence classification with a large number of samples. For SST2 and SST5 datasets, only test samples are used and cross-validation is performed. They are denoted by SST2-test and SST5-test, respectively.

Benchmarking Methods. We compare SynWMD with 3 other WMD-based methods. They are: 1) original WMD, 2) Word Rotator’s Distance and 3) WMD with IDF weight. Results of WMD using the l2 and cosine distances are reported. Word2vec is used as the backbone word embedding model in this experiment.

Experimental Setup. We set and . All other settings remain the same as those in the STS tasks. The value for the nearest neighbor classifier is chosen from 1 to 30 to achieve the best performance.

Results: Experimental results are shown in Table 3. The cosine distance is better than the l2 distance in all four sentence classification datasets. SynWMD outperforms other WMD-based methods by a large margin in the k-nearest neighbor sentence classification.

Methods MR CR SST2-test SST5-test
WMD 67.68 73.69 66.12 31.81
WMD 70.89 75.18 69.36 34.76
WRD 73.17 75.74 72.99 35.25
WMD+IDF 70.17 75.44 74.41 31.49
WMD+IDF 74.18 76.88 74.41 37.96
SynWMD 76.44 77.08 77.43 38.28
Table 3: Comparison of test accuracy for the k-nearest neighbor sentence classification. The best results of each datasets are displayed in boldface.

5 Conclusion and Future Work

An improved Word Mover’s Distance (WMD) using the dependency parse tree, called SynWMD, was proposed in this work. SynWMD consists of two novel modules: syntax-aware word flow (SWD) and syntax-aware word distance (SWF). SWD examines the co-occurrence relationship between words in parse trees and assigns lower flow to words that co-occur with other words frequently. SWD is used to capture word importance in a sentence using the statistics of the whole corpus. SWF computes both the distance between individual words and their contexts collected by parse trees so that word’s contextual information and sentence’s structural information are incorporated. SynWMD achieves state-of-the-art performance in STS tasks and outperforms other WMD-based methods in sentence classification tasks. As future extensions, we may extend the idea beyond sentence-level by leveraging sentence embeddings and incorporate it in sentence-embedding-based methods to lower the computational cost.