DeepAI
Log In Sign Up

What Makes Sentences Semantically Related: A Textual Relatedness Dataset and Empirical Study

The degree of semantic relatedness (or, closeness in meaning) of two units of language has long been considered fundamental to understanding meaning. Automatically determining relatedness has many applications such as question answering and summarization. However, prior NLP work has largely focused on semantic similarity (a subset of relatedness), because of a lack of relatedness datasets. Here for the first time, we introduce a dataset of semantic relatedness for sentence pairs. This dataset, STR-2021, has 5,500 English sentence pairs manually annotated for semantic relatedness using a comparative annotation framework. We show that the resulting scores have high reliability (repeat annotation correlation of 0.84). We use the dataset to explore a number of questions on what makes two sentences more semantically related. We also evaluate a suite of sentence representation methods on their ability to place pairs that are more related closer to each other in vector space.

READ FULL TEXT VIEW PDF
07/05/2020

CORD19STS: COVID-19 Semantic Textual Similarity Dataset

In order to combat the COVID-19 pandemic, society can benefit from vario...
02/17/2016

A Comprehensive Comparative Study of Word and Sentence Similarity Measures

Sentence similarity is considered the basis of many natural language tas...
02/09/2021

Decontextualization: Making Sentences Stand-Alone

Models for question answering, dialogue agents, and summarization often ...
06/14/2021

Improving Paraphrase Detection with the Adversarial Paraphrasing Task

If two sentences have the same meaning, it should follow that they are e...
12/14/2016

Interpretable Semantic Textual Similarity: Finding and explaining differences between sentences

User acceptance of artificial intelligence agents might depend on their ...
12/21/2021

Sentence Embeddings and High-speed Similarity Search for Fast Computer Assisted Annotation of Legal Documents

Human-performed annotation of sentences in legal documents is an importa...
12/17/2018

Siamese Networks for Semantic Pattern Similarity

Semantic Pattern Similarity is an interesting, though not often encounte...

1 Introduction

The semantic relatedness of two linguistic units—words, phrases, sentences, etc.—is a measure of how close they are in terms of their meaning Mohammad (2008); Mohammad and Hirst (2012). Though our intrinsic measure of semantic relatedness is dependent on many factors such as the context of assessment, age, and socio-economic status (Harispe et al., 2015), it is argued that a consensus can usually be reached for many pairs (Harispe et al., 2015). Consider the two sentence pairs in Table 1. Most speakers of English will agree that the sentences in the first pair are closer in meaning to one another than those in the second. When judging the semantic relatedness between two sentences, humans generally look for commonalities in meaning: whether they are on the same topic, expressing the same view, originate from the same time period, have similar style, etc.

Pair 1: a. There was a lemon tree next to the house.
b. The boy enjoyed reading under the lemon tree.
Pair 2: a. There was a lemon tree next to the house.
b. The boy was an excellent football player.
Table 1: Most people will agree that the sentence in pair 1 are more related than sentences in pair 2.

The semantic relatedness of two units of language has long been considered fundamental to understanding meaning Harispe et al. (2015); Miller and Charles (1991); given how difficult it has been to define meaning, a natural approach to get at the meaning of a unit is to determine how close it is to other units. Semantic relatedness is also central to textual coherence and narrative structure. A large number of sentences in a document

are semantically related to each other, and this is a crucial component of meaningful communication. Automatically determining semantic relatedness has many applications such as question answering, plagiarism detection, text generation (say in personal assistants and chat bots), and summarization.

However, prior NLP work has focused on semantic similarity (a small subset of semantic relatedness), largely because of a dearth of datasets. The few relatedness datasets that exist are for word pairs Rubenstein and Goodenough (1965); Radinsky et al. (2011) or phrase pairs Asaadi et al. (2019). Further, most existing datasets were annotated, one item at a time, using coarse rating labels such as integer values between 1 and 5 representing coarse degrees of closeness. It is well documented that such approaches suffer from inter- and intra-annotator inconsistency, scale region bias, and issues arising due to the fixed granularity Presser and Schuman (1996); Baumgartner and Steenkamp (2001). Further, the notions of related and unrelated have fuzzy boundaries. Different people may have different intuitions of where such a boundary exists, and different applications may have different requirements of where such a boundary is optimal.

In this paper, we present the first manually annotated dataset of sentence–sentence semantic relatedness. It includes fine-grained scores of relatedness from 0 (least related) to 1 (most related) for 5,500 English sentence pairs. The sentences are taken from diverse sources and thus also have diverse sentence structures, varying amounts of lexical overlap, and varying formality.

The relatedness scores were obtained using a comparative annotation schema. In comparative annotations, two (or more) items are presented together and the annotator has to determine which is greater with respect to the metric of interest. Since annotators are making relative judgments, the limitations discussed earlier for rating scales are greatly mitigated. Importantly, such annotations do not rely on arbitrary boundaries between arbitrary categories such as “ strongly related” and “ somewhat related”. For this work specifically, we make use of Best–Worst scaling (BWS) Louviere and Woodworth (1991), a comparative annotation method, which has been shown to produce reliable scores with fewer annotations in other NLP tasks Kiritchenko and Mohammad (2017).

We use the relatedness dataset to explore several research questions, including:

  1. To what extent do speakers of English agree on the relatedness of pairs of sentences? (§4)

  2. What makes two sentences more related? (§5)

  3. How well do existing approaches of sentence representation capture semantic relatedness (by placing related sentence pairs closer to each other in vector space)? (§6)

We refer to our dataset as STR-2021, and the task of predicting relatedness between sentences as the Semantic Textual Relatedness (STR) task. Data, data statement, ethics statement, annotation questionnaires, and code are made freely available.111https://github.com/Priya22/semantic-textual-relatedness We hope that this work will spur further research on understanding sentence–sentence relatedness, methods of sentence representation, measures of semantic relatedness, and their applications.

2 Background and Related Work

2.1 Relatedness and Similarity

Closeness of meaning can be of two kinds: semantic relatedness and semantic similarity. Two terms are considered semantically similar if there is a synonymy, hyponymy (hypernymy), or troponymy relation between them (examples include doctor–physician and mammal–elephant). Two terms are considered to be semantically related if there is any lexical semantic relation at all between them. Thus all similar pairs are also related, but not all related pairs are similar. For example, surgeon–scalpel and tree-shade are related, but not similar.

There are several word-pair datasets capturing similarity and relatedness Rubenstein and Goodenough (1965); Finkelstein et al. (2001); Miller and Charles (1991); Radinsky et al. (2011). Asaadi et al. (2019) created a fine-grained relatedness dataset that included unigrams and bigrams. They also used comparative annotations. Our work closely follows theirs in how we use Best–Worst Scaling (BWS) to obtain relatedness scores, but for sentence pairs instead of term pairs.

Analogous to term pairs, two sentences are considered semantically similar when they have a paraphrasal or entailment relation, whereas relatedness accounts for all of the commonalities that can exist between two sentences. For example, the sentences in Table 1 Pair 1 are highly related, but they are not paraphrases or entailing.

Several datasets for sentence pair similarity have been created, including: STS Agirre et al. (2012, 2013, 2014, 2015, 2016)

, MRPC

Dolan and Brockett (2005), and LiSent Li et al. (2006). All of them ask annotators to provide coarse labels for sentence pairs corresponding to the degree of semantic similarity (usually an integer between 1 and 5). However, even with such coarse labeling, often the categorical definitions make annotation difficult: for e.g., the STS 2012–2016 questionnaires ask annotators to make the distinction between 2: not equivalent but share some details and 1: not equivalent, but are on the same topic, which is often not straightforward. In contrast, it is often easier to compare pairs of sentences and determine which is more related. (Note also that even though STS was meant to determine semantic similarity, the descriptions of categories 1 and 2 incorporate aspects of semantic relatedness, thereby muddying the waters w.r.t. the phenomenon being annotated.) SICK Marelli et al. (2014) had a coarse ordinal labeling scheme with paraphrases on one end, entailment in the middle, and contradictory sentence pairs on the other end. While this dataset was called a semantic relatedness dataset, as evident from the labeling schema, it is more in-line with a similarity or entailment annotation. Conneau and Kiela (2018) and Chandrasekaran and Mago (2021) categorize it as a ‘semantic similarity’ dataset.

For our annotations, we avoid fuzzy ill-defined levels of relatedness. We rely instead on the intuitions of fluent English speakers to judge relative rankings of sentence pairs by relatedness.

2.2 Comparative Annotations

The simplest form of comparative annotations is paired comparisons Thurstone (1927); David (1963). Here, annotators are presented with pairs of examples and asked to choose which item is greater with respect to the metric of interest (semantic relatedness, sentiment, etc.). These choices can then be used to generate an ordinal ranking of items and a real-valued score for each item. While paired comparisons, as a methodology, does not suffer from the drawbacks mentioned previously, it requires a significant number of annotations for reliable labels (, where # items).

Best–Worst Scaling (BWS) is a comparative annotation schema, which builds on pair-wise comparisons and does not require as many labels. Annotators are presented with items at a time (for our work: = 4 and an item is a pair of sentences). They are instructed to choose the best (i.e., most related) and worst (i.e., least related) item. Annotation for each 4-tuple provides us five pair-wise inequalities. For example if is marked as most related and as least related, then we know that , and . From all the annotations (and corresponding inequalities) we can calculate real-valued scores (and thus ordinal ranking of items) using a simple counting mechanism Orme (2009); Flynn and Marley (2014): the fraction of times an item was chosen as the best (i.e., most related) minus the fraction of times the item was chosen as the worst (i.e., least related). Given N items, reliable scores are obtainable from () 4-tuples Louviere and Woodworth (1991); Kiritchenko and Mohammad (2017).

3 Creating STR-2021

Dataset creation included several steps: determining the sentence pairs to be annotated, designing the annotation questionnaire, crowdsourcing the annotation, and finally aggregating the information to obtain relatedness scores.

3.1 Data Sources

Like some previous work on semantic similarity, we chose to construct our dataset by sampling sentences from many sources to capture a wide variety of text in terms of sentence structure, formality, and grammaticality. Pairs of sentences were created from the source sentences in a number of ways as described ahead. The sources are:

  1. Formality Rao and Tetreault (2018): Pairs of sentences having the same meaning but differing in formality (one formal, one informal).

  2. Goodreads Wan and McAuley (2018): Book reviews from the Goodreads website.222We only accessed sentences from the ‘Fantasy and Paranormal’ genre, since it contained the most reviews per book (and thus potentially rich in related sentence pairs).

  3. ParaNMT Wieting and Gimpel (2018): Paraphrases from a machine translation system.

  4. SNLI Bowman et al. (2015): Pairs of premises and hypotheses, created from image captions, for natural language inference.

  5. STS Cer et al. (2017): Pairs of sentences with semantic similarity scores. (Integer label responses, 0 to 5, from multiple annotators were averaged to obtain the similarity scores.)

  6. Stance Mohammad et al. (2016): Tweets labelled for both sentiment (positive, negative, neutral) and stance (for, against, neither) towards targets (e.g., Donald Trump, Feminism).

  7. Wikipedia Text Simplification Dataset Horn et al. : Pairs of Wikipedia sentences and their simplified forms.

From each source, we sampled sentences that were between 5 and 25 words long.

Since randomly sampling sentence pairings would result in mostly unrelated sentences, we selected sentence pairs with varying amounts of lexical overlap.333Further implementation and data source details are presented in Appendix Section A.) This also allowed us to systematically study the impact of lexical overlap on semantic relatedness. For the paraphrase datasets (Formality, ParaNMT, and Wikipedia), we obtained sentence pairs in two ways: by directly taking the paraphrase pairs (indicated by the suffix _pp, and by randomly pairing sentences from two different paraphrase pairs (suffixed by _r). The paraphrase pairs were selected at random from the source dataset, whereas the lexical overlap strategy was applied in the creation of the random pairs. From STS, we randomly sampled 50 sentence pairs having similarity scores between , 50 pairs having scores between , and so on. Table 2 summarizes key details of the sentence pairs in STR-2021. Further details about the source datasets and sampling are presented in the Appendix.

Types of Pairs Key Attributes # pairs
1. Formality paraphrases, style 1000
Formality_pp paraphrases, differ in style 300
Formality_r random pairs 700
2. Goodreads reviews, informal 1000
3. ParaNMT automatic paraphrases 750
ParaNMT_pp automatic paraphrases 450
ParaNMT_r random pairs 300
4. SNLI captions of images 750
5. STS have similarity scores 250
6. Stance tweet pairs with same ha-
shtag, less grammatical 750
7. Wikipedia formal 1000
Wiki_pp paraphrases, formal 500
Wiki_r random pairs, formal 500
ALL 5500
Table 2: Summary of sentence pair types in STR-2021.

3.2 Annotating For Semantic Relatedness

From the list of 5,500 sentence pairs, we generated 11,000 unique 4-tuples (each 4-tuple consists of 4 distinct sentence pairs) such that each sentence pair occurs in around eight 4-tuples.444The tuples were generated using the BWS scripts provided by Kiritchenko and Mohammad (2017): http://saifmohammad.com/WebPages/BestWorst.html.

The task, as presented to annotators, heavily borrows from the bigram relatedness task of Asaadi et al. (2019). Annotators were asked to judge the closeness in meaning of sentence pairs. In our framing of the task, we did not use detailed or technical definitions, rather we provided brief and easy to follow instructions, gave examples, and encouraged annotators to rely on their intuitions of the English language to judge relative closeness in meaning of sentence pairs. The full questionnaire is included in the supplementary material.

3.2.1 Crowdsourcing Annotations

We used Amazon Mechanical Turk for obtaining annotations.555This project was approved by the first author’s Institutional Research Ethics Board (Protocol#:40736). Each 4-tuple (also referred to as a question) in our AMT task consists of four sentence pairs. Annotators are asked to choose the (a) most-related, and (b) least-related sentence pairs from among these four options. Each question is annotated by two MTurk workers.666We ran initial pilot studies with 6 annotators per question, and found that the annotations reliability scores were comparable for 6, 4, and 2 annotators per 4-tuple..

As part of quality control, the task was open only to fluent speakers of English and those AMT workers with an approval rate higher than 98%. Further we inserted “Gold Standard” questions at regular intervals in the task. These questions were manually annotated by all the authors, and had high agreement scores. If one gets a gold question wrong, they are immediately notified and shown the correct answer. This has several benefits, including: keeping the annotator alert and clearing any misunderstandings about the task. Those who scored less than 70% on the gold questions were stopped from answering further questions and were paid for their work. All their responses were discarded.

3.2.2 Annotation Aggregation

Figure 1: Histogram of STR-2021 relatedness scores.

We aggregate information from various responses by using the counting procedure discussed in §2.2

. Since relatedness is a unipolar scale, the resulting relatedness score was linearly transformed to fit within a 0–1 scale of increasing relatedness.

Figure 1 presents a histogram of relatedness scores for STR-2021. Observe that each of the subsets covers a wide range of relatedness scores. Note that the lexical overlap sampling strategy has resulted in a wide spread of relatedness scores. Note also that supposed paraphrases are spread across much of the right half of the relatedness scale.

# Sentence Pairs # Tuples # Annotations Per Tuple # Annotations # Annotators SHR
5,500 11,000 8 21,936 389 0.84
Table 3: Annotation statistics. SHR = split-half reliability (as measured by Spearman correlation).

4 Reliability of Annotations

For annotations producing real-valued scores, a commonly used measure of quality and reliability is split-half reliability (SHR) Cronbach (1951); Kuder and Richardson (1937). SHR is a measure of the degree to which repeating the annotations would result in similar relative rankings of the items.

To measure SHR, annotations for each 4-tuple are split into two bins. The annotations for each bin are used to produce two different independent relatedness scores. Next, the Spearman correlation between the two sets of scores is calculated—a measure of the closeness of the two rankings. If the annotations are reliable then there should be a high correlation. This process is repeated 1000 times and the correlation scores are averaged. Table 3 shows the result. SHR of 0.84 indicates high annotation reliability. This score is also similar to SHR scores obtained in prior work on ranking tweets by emotions (between 0.82 and 0.89) Mohammad and Kiritchenko (2018).

We also conducted experiments to assess fine-grained rankings of common sentence pairs as per our relatedness scores and as per STS’s similarity scores. For each of the sets of 50 sentence pairs taken from STS (with scores between (0–1], (1–2], etc.), we calculated the Spearman correlations between the rankings by similarity and rankings by relatedness. We found that the correlations are only 0.25 (weak) and 0.19 (very weak) for the bins of (1,2] and (3,4], respectively, and only about 0.49 (moderate) for the bins of (2,3] and (4,5]. The (0,1] bin produces a correlation of 0.67 (moderate). Overall, this shows that the fine-grained ranking of items in the STS dataset by similarity, differ considerably from the fine-grained rankings of items in the STR dataset.

5 What Makes Sentences More Semantically Related?

The availability of a dataset with human notions of semantic relatedness allows one to explore fundamental aspects of meaning: for example, what makes two sentences more related? In this section, we examine some basic questions. On average, to what extent is the semantic relatedness of a sentence pair impacted by presence of:

  • identical words (lexical overlap)? (Q1)

  • related words? (Q2)

  • related words of the same part of speech? (Q3)

  • related subjects, related objects? (Q4)

To explore these questions, we first computed relevant measures for Q1 through Q4 (lexical overlap, term relatedness, etc.) for each sentence pair in our dataset. We then calculated the correlations of these scores with the gold relatedness scores.

Lexical Overlap. A simple measure of lexical overlap between two sentences X and Y is the Dice Coefficient (the number of unique unigrams occurring in both sentences adjusted by their lengths):

(1)

Related Words:

We averaged the embeddings for all the tokens in a sentence and computed the cosine similarity between the averaged embeddings for the two sentences in a pair. This roughly captures the relatedness between the terms across the two sentences.

777

Other ways to estimate relatedness between sets of words across two sentences may also be used.

Token embeddings were taken from Google’s publicly released Word2Vec embeddings trained on the Google News corpus Mikolov et al. (2013a).888https://code.google.com/archive/p/word2vec/

Related Words with same POS: The same procedure was followed as for Q2, except that only the tokens for one part of speech (POS) at a time were considered. We determined the part-of-speech of the tokens using spaCy Honnibal et al. (2020).999We used the simple (coarse-grained) UPOS part-of-speech tags: https://universaldependencies.org/docs/u/pos/

Related Subjects and Related Objects: For Q4, which explores the importance of different parts of sentence structure in determining semantic relatedness, we employ the same process as Q2, except that for a given sentence: only those tokens marked as subject are averaged; and only those tokens marked as object are averaged. We use the packages spaCy Honnibal et al. (2020) and Subject Verb Object Extractor de Vocht (2020) to determine all tokens that are the subject and object in a sentence.

Question Spearman # pairs
Q1. Lexical overlap 0.57 5500
Q2. Related words - All 0.61 5500
Q3a. Related words - per POS
PROPN 0.50 1907
NOUN 0.45 4746
ADJ 0.36 2236
VERB 0.31 3946
PRON 0.30 1800
ADV 0.28 1147
AUX 0.25 2069
ADP 0.23 2476
DET 0.20 3265
Q3b. Related words - per POS group
Noun Group 0.60 5478
Verb Group 0.32 4999
ADJ Group 0.29 4584
Q4. Related Subjects and Objects
Subject 0.29 1611
Object 0.43 1618
Table 4: Correlation between various features and the relatedness of sentence pairs. A rule of thumb for interpreting the numbers: 0–.19: very weak; .2–.39: weak; .4–.59: moderate; .6–.79: strong; .8–1: very strong.

RESULTS: Table 4 shows the results. Row Q1 shows that simple word overlap obtains a correlation of 0.57 (considered to be at the high end of weak correlation). Figure 2 is a scatter plot where the x-axis is the word overlap score, the y-axis is the relatedness score, and each dot is a sentence pair. Observe that a number of pairs fall along the diagonal, however there are also a large number of pairs along the top-left side of this diagonal. This suggests that even though STR-2021 has pairs where the relatedness increases linearly with the amount of word overlap, there are also a number of pairs where a small amount of word overlap results in substantial amount of relatedness. The relatively sparse bottom-right side of the plot indicates that it is rare for there to be substantial word overlap, and yet very low relatedness.

On average, occurrence of related words across a sentence pair leads to slightly higher sentence-pair relatedness scores than lexical overlap (row Q2). Note, identical words are also considered as related.

Figure 2: Scatter plot showing the relationship between lexical overlap and semantic relatedness of sentence pairs. Each dot in the plot is a sentence pair.

The Q3a rows in Table 4 show correlations for related tokens of a given part of speech.101010Only those POS tags that occur in both sentences of a pair in more than 10% of the pairs are considered (550 pairs). (The rows are in order from highest to lowest correlation.) Observe that proper nouns (PROPN) and nouns have the highest numbers. It is somewhat surprising that related verbs do not contribute greatly to semantic relatedness; they have similar correlations as pronouns and adverbs, and markedly lower than adjectives and nouns. Not surprisingly, determiners (DET) are at the lower end of weak correlation.

The Q3b rows show correlations when we consider coarse categories comprised of similar parts of speech. NOUN Group is composed of NOUN, PRON, and PROPN. VERB Group is composed of VERB and AUX. ADJ Group is composed of ADJ, ADP, ADV. The results show that presence of related nouns in a sentence pair impacts semantic relatedness much more than any other POS groups.

Since related nouns were found to be especially important, we also wanted to determine what impacts overall relatedness more: the presence of related nouns in the subject position or in the object position. Q4 rows show that, on average, related objects lead to markedly higher sentence-pair relatedness than related subjects.

Discussion: The above experiments show, for the first time, the impact of word overlap and presence of related words on the semantic relatedness of sentence pairs. Figure 2 shows that even small amounts of word overlap can result in high semantic relatedness. The notable importance of related noun terms is likely because they indicate a common topic, person, or object being talked about in both sentences—making the two sentence pairs related. The higher correlations of objects is perhaps an indication that the sentence pairs in our dataset are drawn from various documents and webpages, and so the relatedness is often driven by what is happening or being talked about. It is possible, that for sentence pairs drawn from the same narrative/post, the subject plays a greater role in the semantic relatedness of a sentence pair. Further studies with various other forms of data collection can shed more light into what makes two sentences more related, and under what conditions.

6 Evaluating Sentence Representation Models using STR-2021

Since STR-2021 captures a wide range of fine-grained relations that exist between sentences, it is a valuable asset in evaluating sentence representation/embedding models. Essentially, predicting semantic relatedness is treated as a regression task, where first, using various unsupervised and supervised approaches described in the two sub-sections below, we represent each sentence as a vector. We use the cosine similarity between the vectors as a prediction of their semantic relatedness. We use the Spearman correlation between the prediction and gold relatedness scores to measure the goodness of the relatedness predictions (and in turn of the sentence representation).

The experiments below (unless otherwise specified) all involve 5-fold cross-validation (CV) on STR-2021. We report the average of the Spearman correlations across the folds. Note that even for models that do not require training (e.g., Dice score), to enable direct comparisons with trained methods, we evaluate their performance on each test fold independently and report the average of the correlations across folds.

6.1 Do Unsupervised Embeddings Capture Semantic Relatedness?

We first explore unsupervised approaches to sentence representation where the embedding of a sentence is derived from that of its constituent tokens. The token embedding can be of two types:

  • Noncontextual Word Embeddings: We tested three popular models: Word2Vec Mikolov et al. (2013b), GloVe Pennington et al. (2014), and Fasttext Grave et al. (2018).

  • Contextual Word Embeddings: We tested pretrained contextual embeddings from BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019). We use the bert-base-uncased and roberta-base models from the HuggingFace library.111111https://huggingface.co

We obtain sentence embeddings from the token embeddings both by mean-pooling the token embeddings from the final layer and max-pooling the token embeddings of the final layer. For the contextual embeddings, we also explore using the embedding of the classification token ([CLS]).

121212The [CLS] token is the first token of every sequence encoded by the model, and is generally used as an aggregate sequence representation for downstream tasks.

Table 5 shows the results. As baseline, we include how well simple lexical overlap (Dice score) predicts relatedness (row 1). Observe that mean-pooling with word2vec (row 2) obtains slightly higher correlation than the baseline, but the majority of the non-contextual embedding models fail to obtain better correlations (rows 3–9). The contextual embeddings from BERT and RoBERTa do not perform better than the word2vec embeddings (rows 10–15). Overall, unsupervised sentence representation methods leave room for improvement.

Model
Spearman
Baseline
1. Lexical overlap (Dice) 0.57
Unsupervised, Non-Contextual Embeddings
2. Word2Vec (mean, Googlenews) 0.60
3. Word2Vec (max, Googlenews) 0.54
4. GloVe (mean, Common Crawl) 0.49
5. GloVe (max, Common Crawl) 0.56
6. GloVe (mean, 200_Twitter) 0.44
7. GloVe (max, 200_Twitter) 0.48
8. Fasttext (mean, Common crawl) 0.29
9. Fasttext (max, Common crawl) 0.24
Unsupervised, Contextual Embeddings
10. BERT-base (mean) 0.58
11. BERT-base (max) 0.55
12. BERT-base (cls) 0.41
13. RoBERTa-base (mean) 0.48
14. RoBERTa-base (max) 0.47
15. RoBERTa-base (cls) 0.41
Supervised (Fine-tuning on portions of STR-2021)
16. BERT-base (mean) 0.82
17. RoBERTa-base (mean) 0.83
Table 5: Average correlation between human annotated relatedness of sentence pairs and the cosine distance between their embeddings across the CV runs.

6.2 Do Supervised Embeddings Capture Semantic Relatedness?

We now evaluate the performance of BERT-based models on STR-2021 when formulated as a supervised regression task. We use the S-BERT cross-encoder framework of Reimers and Gurevych (2019), and apply mean-pooling on top of the token embeddings of the final layer to obtain sentence embeddings. The model is trained using a cosine-similarity loss—the cosine between the embeddings of a sentence pair is compared to the gold semantic relatedness scores to obtain the Mean Squared Error (MSE) loss for each datapoint.

Table 5 rows 16 and 17 show the results: fine-tuning on STR-2021 has led to considerably better relatedness scores.

6.2.1 Impact of Domain on Fine-Tuning

The results above show that fine-tuning is critical for better sentence representation. However, it is well-documented that the domain of the data can have substantial impact on results; especially when quite different from the training data. With the inclusion of data from various domains in STR-2021 (Table 2), one can systematically explore performance on individual domains, as well as the extent to which performance may drop if no training data from the target domain is included for training.

Table 6 shows these results. The RoBERTA CV column shows a breakdown of results on sentence pairs from each source (domain). Essentially, these are results for the scenario where some portion of in-domain data is included in the training folds (along with data from other domains), and the system correlations are determined only on the test fold’s target domain pairs. Observe that performance on most domains is comparable to each other, except for the stance domain where correlations are much lower.131313The stance subset has a smaller range of relatedness scores than other subsets, and lower range is known to lead to lower correlations. Thus it’s correlations are not directly comparable to that of the other subsets.
(See: https://www.statisticshowto.com/restricted-range/)

The LOOut column shows correlations with a leave-one-out setup: no in-domain training data used and system correlations are determined only for the target domain pairs. Observe that this leads to drops in scores for all domains except STS.141414The STS set is small and itself includes data from many domains, and thus is not ideal for examining domain impact. However, the drop is only a few percentage points; and scores are still much higher than the lexical overlap (Dice) baseline. This suggests that the diversity of data in the remaining subsets is useful in overcoming a lack of in-domain training data.

Dice SBERT(RoBERTa)
CV CV LOOut
STS 0.60 0.79 0.82
SNLI 0.53 0.80 0.77
Stance 0.20 0.49 0.39
Goodreads 0.44 0.73 0.70
Wiki 0.48 0.79 0.75
Formality 0.69 0.86 0.83
ParaNMT 0.44 0.80 0.79
Table 6: Breakdown of average test-fold correlations for each source: (a) using lexical overlap (Dice), (b) using SBERT and some in-domain data for fine-tuning (in addition to data from other domains), and (c) using SBERT and only out-of-domain data for fine-tuning (LOOut). CV: cross-validation. LOOut: leave-one-out.

7 Conclusions

We created STR-2021, the first dataset of English sentence pairs annotated with fine-grained relatedness scores. We used a comparative annotation method that produced a split-half reliability of 0.84. Thus, we showed that speakers of a language can reliably judge semantic relatedness. We used the dataset to explore several research questions pertaining to what makes two sentence more related.

We showed, for the first time, the impact of word overlap and presence of related words on the semantic relatedness of sentence pairs. We showed that, on average, occurrence of related proper nouns and nouns across a sentence pair increases their relatedness the most, compared to other parts of speech. We also showed that the presence of related objects often leads to greater sentence-pair relatedness than related subjects.

Finally, we used STR-2021 to evaluate the ability of various sentence representation methods to embed sentences in vector spaces such that those that are closer to each other in meaning are also closer in the vector space. We found that most unsupervised sentence representation methods (using either contextual or non-contextual word embeddings) are unable to surpass the lexical overlap baseline. However, when trained on portions of STR-2021, supervised approaches to sentence representation perform substantially better reaching average correlations of 0.83 with the gold relatedness scores of the held out test folds. We also presented a breakdown of the correlations on subsets of STR-2021 corresponding to different domains (sources) and the impact of not having domain data for training. We make STR-2021 freely available to foster further research in semantic relatedness and sentence representation.

8 Ethics Statement

This paper respects existing intellectual property by only making use of publicly and freely available datasets. The crowd-sourced task was approved by our Institutional Research Ethics Board. The annotators were based in the United States of America and were paid the federal minimum wage of $7.25. Our annotation process stored no information about annotator identity and as such there is no privacy risk to them. The individual sentences selected did not have any risks to privacy either (as evaluated by manual annotation of the sentences). Models trained on this dataset may not generalize to external datasets gathered from different populations. Knowledge about language features may not generalize to other languages.

Any dataset of semantic relatedness entails several ethical considerations. We list some notable ones below:

  • Coverage: We sampled English sentences from a diverse array of sources from the internet, with a focus on social media. Yet, it is likely that several types of sentences (and several demographic groups) are not well-represented in STR-2021. The dataset likely includes more sentences by people from the United States and Europe and with a socio-economic and educational backgrounds that allow for social media access.

  • Not Immutable: The relatedness scores do not indicate an inherent unchangeable attribute. The relatedness can change with time, but the dataset entries are largely fixed. They pertain to the time they are created.

  • Socio-Cultural Biases: The annotations of relatedness capture various human biases. These biases may be systematically different for different socio-cultural groups. Our data was annotated by US annotators, but even with the US there are different socio-cultural groups.

  • Inappropriate Biases: Our biases impact how we view the world, and some of the biases of an individual may be inappropriate. For example, one may have race or gender-related biases that may percolate subtly into one’s notions of how related two units of text are. Our dataset curation was careful to avoid sentences from problematic sources, and we have not seen any inappropriate relatedness judgments, but it is possible that some subtle inappropriate biases still remain. Thus, as with any approach for sentence representation or semantic relatedness, we caution users to explicitly check for such biases in their system regardless of whether they use STR-2021.

  • Perceptions (not “right” or “correct” labels): Our goal here was to identify common perceptions of semantic relatedness. These are not meant to be “correct” or “right” answers, but rather what the majority of the annotators believe based on their intuitions of the English language.

  • Relative (not Absolute): The absolute values of the relatedness scores themselves have no meaning. The scores help order sentence pairs relative to each other. For example, a pair with a higher relatedness score should be considered more related than a pair with with a lower score. No claim is made that the mid-point (relatedness score of 0.5) separates related words from unrelated words. One may determine categories such as related or unrelated by finding thresholds of relatedness scores optimal for their use/task.

We welcome feedback on further considerations to highlight in this section.

Acknowledgments

Mohamed is supported by a Vanier Graduate Scholarship. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute

www.vectorinstitute.ai/partners.

References

  • E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, I. Lopez-Gazpio, M. Maritxalar, R. Mihalcea, et al. (2015) SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 252–263. Cited by: §2.1.
  • E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, R. Mihalcea, G. Rigau, and J. Wiebe (2014) SemEval-2014 Task 10: Multilingual Semantic Textual Similarity. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 81–91. Cited by: §2.1.
  • E. Agirre, C. Banea, D. Cer, M. Diab, A. Gonzalez Agirre, R. Mihalcea, G. Rigau Claramunt, and J. Wiebe (2016) SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-lingual Evaluation. In SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016. p. 497-511., Cited by: §2.1.
  • E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, and W. Guo (2013) * SEM 2013 shared task: Semantic Textual Similarity. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pp. 32–43. Cited by: §2.1.
  • E. Agirre, D. Cer, M. Diab, and A. Gonzalez-Agirre (2012) SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pp. 385–393. Cited by: §2.1.
  • S. Asaadi, S. Mohammad, and S. Kiritchenko (2019) Big BiRD: A large, fine-grained, bigram relatedness dataset for examining semantic composition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 505–516. Cited by: §1, §2.1, §3.2.
  • H. Baumgartner and J. E.M. Steenkamp (2001) Response styles in marketing research: a cross-national investigation. Journal of Marketing Research 38 (2), pp. 143–156. Cited by: §1.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    ,
    Cited by: §A.1.1, item 4.
  • D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia (2017) SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 1–14. Cited by: item 5.
  • D. Chandrasekaran and V. Mago (2021) Evolution of Semantic Similarity—A Survey. ACM Computing Surveys (CSUR) 54 (2), pp. 1–37. Cited by: §2.1.
  • A. Conneau and D. Kiela (2018) SentEval: An Evaluation Toolkit for Universal Sentence Representations. arXiv preprint arXiv:1803.05449. Cited by: §2.1.
  • L. J. Cronbach (1951) Coefficient alpha and the internal structure of tests. Psychometrika 16 (3), pp. 297–334. Cited by: §4.
  • H. A. David (1963) The method of paired comparisons. Vol. 12, London. Cited by: §2.2.
  • P. de Vocht (2020) Subject Verb Object extractor External Links: Link Cited by: §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: 2nd item.
  • W. B. Dolan and C. Brockett (2005) Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), Cited by: §2.1.
  • L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin (2001) Placing Search in Context: The Concept Revisited. In Proceedings of the 10th International Conference on World Wide Web, pp. 406–414. Cited by: §2.1.
  • T. N. Flynn and A. A. Marley (2014) Best-worst scaling: theory and methods. In Handbook of choice modelling, Cited by: §2.2.
  • É. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov (2018) Learning Word Vectors for 157 Languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Cited by: 1st item.
  • S. Harispe, S. Ranwez, S. Janaqi, and J. Montmain (2015) Semantic Similarity from Natural Language and Ontology Analysis. Morgan & Claypool Publishers. Cited by: §1, §1.
  • M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd (2020) spaCy: Industrial-strength Natural Language Processing in Python External Links: Document, Link Cited by: §5, §5.
  • [22] C. Horn, C. Manduca, and D. Kauchak Learning a Lexical Simplifier Using Wikipedia. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Cited by: item 7.
  • S. Kiritchenko and S. Mohammad (2017) Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 465–470. Cited by: §1, §2.2, footnote 4.
  • G. F. Kuder and M. W. Richardson (1937) The theory of the estimation of test reliability. Psychometrika 2 (3), pp. 151–160. Cited by: §4.
  • Y. Li, D. McLean, Z. A. Bandar, J. D. O’shea, and K. Crockett (2006) Sentence Similarity Based on Semantic Nets and Corpus Statistics. IEEE Transactions on Knowledge and Data Engineering 18 (8), pp. 1138–1150. Cited by: §2.1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692. Cited by: 2nd item.
  • J. J. Louviere and G. G. Woodworth (1991) Best-worst scaling: A model for the largest difference judgments. Technical report Working paper. Cited by: §1, §2.2.
  • M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi, R. Zamparelli, et al. (2014) A SICK cure for the evaluation of compositional distributional semantic models.. In LREC, pp. 216–223. Cited by: §2.1.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013a) Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781. Cited by: §5.
  • T. Mikolov, W. Yih, and G. Zweig (2013b) Linguistic Regularities in Continuous Space Word Representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751. Cited by: 1st item.
  • G. A. Miller and W. G. Charles (1991) Contextual correlates of semantic similarity. Language and Cognitive Processes 6 (1), pp. 1–28. Cited by: §1, §2.1.
  • S. Mohammad, S. Kiritchenko, P. Sobhani, X. Zhu, and C. Cherry (2016) A Dataset for Detecting Stance in Tweets. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 3945–3952. Cited by: §A.1, item 6.
  • S. Mohammad and S. Kiritchenko (2018) Understanding emotions: a dataset of tweets to study interactions between affect categories. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. External Links: Link Cited by: §4.
  • S. M. Mohammad and G. Hirst (2012) Distributional Measures of Semantic Distance: A Survey. arXiv preprint arXiv:1203.1858. Cited by: §1.
  • S. Mohammad (2008) Measuring semantic distance using distributional profiles of concepts. Ph.D. Thesis. Cited by: §1.
  • B. Orme (2009)

    MaxDiff Analysis: Simple Counting, Individual-Level Logit, and Hb

    .
    Sawtooth Software. Cited by: §2.2.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Cited by: 1st item.
  • S. Presser and H. Schuman (1996) Questions and answers in attitude surveys: experiments on question form, wording, and context. SAGE Publications, Inc. Cited by: §1.
  • K. Radinsky, E. Agichtein, E. Gabrilovich, and S. Markovitch (2011) A Word at a Time: Computing Word Relatedness Using Temporal Semantic Analysis. In Proceedings of the 20th International Conference on World Wide Web, pp. 337–346. Cited by: §1, §2.1.
  • S. Rao and J. Tetreault (2018) Dear Sir or Madam, May I Introduce the GYAFC Dataset: Corpus, Benchmarks and Metrics for Formality Style Transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 129–140. Cited by: §A.2, item 1.
  • N. Reimers and I. Gurevych (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: §6.2.
  • P. M. Roget (1911) Roget’s thesaurus of english words and phrases…. TY Crowell Company. Cited by: §A.1.1.
  • H. Rubenstein and J. B. Goodenough (1965) Contextual correlates of synonymy. Communications of the ACM 8 (10), pp. 627–633. Cited by: §1, §2.1.
  • L. L. Thurstone (1927) A Law of Comparative Judgment.. Psychological Review 34 (4), pp. 273. Cited by: §2.2.
  • M. Wan and J. J. McAuley (2018) Item Recommendation on Monotonic Behavior Chains. In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys 2018, Vancouver, BC, Canada, October 2-7, 2018, S. Pera, M. D. Ekstrand, X. Amatriain, and J. O’Donovan (Eds.), pp. 86–94. External Links: Link, Document Cited by: §A.1.3, item 2.
  • M. Wan, R. Misra, N. Nakashole, and J. J. McAuley (2019) Fine-Grained Spoiler Detection from Large-Scale Review Corpora. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 2605–2610. External Links: Link, Document Cited by: §A.1.3.
  • J. Wieting and K. Gimpel (2018) ParaNMT-50M: pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 451–462. External Links: Link, Document Cited by: §A.1.4, item 3.

Appendix A Further Details on Sampling Sentence Pairs from Source Datasets

This Appendix provides further information about the sources of data and how sentence pairs were sampled from them to be included in STR-2021.

a.0.1 STS Data

We selected 250 sentence pairs from existing STS corpora. This selection was done to enable a small investigation into the interplay between relatedness and similarity, which could serve as motivation for further investigation in future work. For this dataset, we randomly sampled 50 sentence pairs from each of bucket of annotation (i.e., 50 sentence pairs having an STS similarity scores falling in the range of , 50 sentence pairs having scores between , and so on).

a.1 Stance Data

We created 750 sentence pairs from Mohammad et al. (2016)’s dataset of tweets labeled for stance. The original dataset is composed of individual tweets labelled for both stance (‘For’, ‘Against’, ‘Neither Inference Likely’) and sentiment (‘Positive’, ‘Negative’, ‘Neutral’). The dataset was built from tweets focused on six targets: ‘Atheism’, ‘Climate Change’, ‘Donald Trump’, ‘Feminism’, ‘Hillary Clinton’, ‘Abortion’.

When curating our sentence pairs, we limited the possible targets to ‘Hillary Clinton’, ‘Donald Trump’, and ‘Abortion’. Sentence pairs were chosen such that both sentences shared the same target. 500 sentence pairs shared their stance towards their target ( i.e., 250 for–for pairs and 250 against–against pairs). 250 sentences pairs differed on their stance (i.e, 250 for–against pairs)

. We did not use any lexical overlap heuristic to specify which tweets should be paired with each other because we were interested in studying if overlap in topic was a strong enough signal to impact relatedness. That is, by choosing pairs with the same target, we were already pre-selecting for various degrees of relatedness.

a.1.1 SNLI Data

We created 750 sentence pairs from the Stanford Natural Language Inference (SNLI) Dataset Bowman et al. (2015). SNLI is composed of image description captions; for each caption, multiple premise sentences are generated, along with multiple possible hypotheses sentences that could possibly belong to each premise. To build our sentence pairs we sought to pair different premise sentences together. We did not wish to pair between premise and hypothesis sentences as the sentence structure was significantly different (and simpler for the hypothesis sentences), as noted by the creators of the dataset. Even still, the majority of premise sentences are very short (with a mean token count of 14), often following very simple (and similar) grammatical structure.

To generate the sentence pairs, first we removed all sentences with less than 5 or more than 25 tokens. Then, for each token in all remaining sentences, we replaced each token with it’s most frequent synonym, using Roget’s thesaurus Roget (1911), to define synonymous relationships. Words which did not have synonyms were left unchanged. The intention behind replacing each word with its most frequent synonym was to ensure that synonymous phrasings would count as overlaps when we measure it. We then randomly selected 750 sentences to serve as the first sentence of our final pairings. To find the second sentence to each pairing we looped through all premise sentences and returned the first sentence which satisfied two conditions: 1) The unigram overlap was greater than or equal to 25% and less than 75% of the first sentence, and 2) the difference in length between both sentences did not exceed 25%.

a.1.2 Wikipedia Data

We sampled 1000 sentence pairs from a dataset that pairs sentences from English Wikipedia with sentences from Simple English Wikipedia. Created to enable the task of sentence simplification the paired sentences, paired using rules-based classification, are often very closely related. We used this dataset in two ways: 1. Extracting sentence pairs which serve as paraphrases or near paraphrases (we refer to these as Wiki_pp), and 2. pairing sentences to other random sentences in the dataset (we refer to these as Wiki_r).

Wiki_PP: First, we removed any pairings for which either sentence was less than 5 words or more than 25 words. Then we narrowed the list of pairings further by removing any pairings that did not share more than 25% and less than 75% of unique unigrams. From the remaining sentence pairs, we randomly selected 500 paired sentences.

Wiki_R: Here, we only make use of the full sentences from the original Wikipedia, discarding sentences from Simple Wikipedia. We remove all sentences that have less than 5 or more than 25 tokens. To create the sentence pairs, we loop in a random order through all possible pairing of sentences. We pair two sentences if they share at least 25% of their tokens and less than 75% of their tokens AND the difference in length between both sentences did not exceed 25%. We stop once we have generated 500 sentence pairs.

a.1.3 Goodreads Data

We created 1000 sentence pairs from the UCSD Goodreads Dataset (Wan and McAuley, 2018; Wan et al., 2019), which has book reviews from the Goodreads website. We limit the sampling to the ‘Fantasy and Paranormal’ genre, since it contained a relatively higher number of reviews per book, allowing for a higher possibility of sampling more related sentence pairs. Each review was first split into sentences using the default NLTK sentence tokenizer; we keep only those sentences with the number of tokens between 5 and 25. We then randomly examine pairs of sentences, and quantify the lexical overlap between then with an IDF-weighted Dice overlap score. The pairs are then assigned to buckets based on this overlap score; the range of each bucket is obtained by first finding 50 equally-spaced percentiles of the entire score distribution. We then sample exponentially increasing number of sentences from low to high weighted Dice overlap bins such that a total of 1000 sentence pairs are included.

a.1.4 ParaNMT Data

ParaNMT (Wieting and Gimpel, 2018)

is a dataset of 51 million sentential paraphrases that were automatically generated using a neural machine translation system. We generated two sets of pairs from these sentences corresponding to paraphrases and random pairs:


ParaNMT_PP: We assign paraphrases to buckets based on the Dice score between the two sentences. We divided the range of scores into 100 equally-sized percentiles. We then sample pairs uniformly from each bucket, for a total of 450 sentence pairs.

ParaNMT_R: For the random, non-paraphrase sentence pairings, we use the Dice score to extract 300 pairs, analogous to the creation of the Wiki_R pairs.

a.2 Formality Data

Our third paraphrase corpus is the Formality dataset from Rao and Tetreault (2018) (They refer to it as GYAFC). This consists of human-written formal and informal paraphrases for sentences sourced from the Yahoo! Answers platform. Our sampling procedure for this dataset follows that of the ParaNMT dataset.

Formality_PP: We assign sentences to one of 50 buckets based on their lexical overlap score as before. We then uniformly sample from each bucket to extract 300 sentence pairs.

Formality_R: We sample random pairings of sentences that define the token overlap and length difference conditions as defined for Wiki_R and ParaNMT_R. We extract 700 such sentence pairs.