A Massive Scale Semantic Similarity Dataset of Historical English

06/30/2023
by   Emily Silcock, et al.
0

A diversity of tasks use language models trained on semantic similarity data. While there are a variety of datasets that capture semantic similarity, they are either constructed from modern web data or are relatively small datasets created in the past decade by human annotators. This study utilizes a novel source, newly digitized articles from off-copyright, local U.S. newspapers, to assemble a massive-scale semantic similarity dataset spanning 70 years from 1920 to 1989 and containing nearly 400M positive semantic similarity pairs. Historically, around half of articles in U.S. local newspapers came from newswires like the Associated Press. While local papers reproduced articles from the newswire, they wrote their own headlines, which form abstractive summaries of the associated articles. We associate articles and their headlines by exploiting document layouts and language understanding. We then use deep neural methods to detect which articles are from the same underlying source, in the presence of substantial noise and abridgement. The headlines of reproduced articles form positive semantic similarity pairs. The resulting publicly available HEADLINES dataset is significantly larger than most existing semantic similarity datasets and covers a much longer span of time. It will facilitate the application of contrastively trained semantic similarity models to a variety of tasks, including the study of semantic change across space and time.

READ FULL TEXT

page 2

page 3

page 5

page 6

page 7

research
04/15/2018

Introducing two Vietnamese Datasets for Evaluating Semantic Models of (Dis-)Similarity and Relatedness

We present two novel datasets for the low-resource language Vietnamese t...
research
11/30/2017

Calculating Semantic Similarity between Academic Articles using Topic Event and Ontology

Determining semantic similarity between academic documents is crucial to...
research
02/16/2017

Clustering articles based on semantic similarity

Document clustering is generally the first step for topic identification...
research
04/30/2015

Texts in, meaning out: neural language models in semantic similarity task for Russian

Distributed vector representations for natural language vocabulary get a...
research
09/24/2021

Rethinking Crowd Sourcing for Semantic Similarity

Estimation of semantic similarity is crucial for a variety of natural la...
research
03/16/2023

SemDeDup: Data-efficient learning at web-scale through semantic deduplication

Progress in machine learning has been driven in large part by massive in...
research
04/17/2019

Headline Generation: Learning from Decomposed Document Titles

We propose a novel method for generating titles for unstructured text do...

Please sign up or login with your details

Forgot password? Click here to reset