The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications

07/08/2022
by   Mirac Suzgun, et al.
5

Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Although the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, ML offers a promising set of techniques for evaluating novelty, summarizing contributions, and embedding semantics. In this paper, we introduce the Harvard USPTO Patent Dataset (HUPD), a large-scale, well-structured, and multi-purpose corpus of English-language patent applications filed to the United States Patent and Trademark Office (USPTO) between 2004 and 2018. With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora. Unlike previously proposed patent datasets in NLP, HUPD contains the inventor-submitted versions of patent applications–not the final versions of granted patents–thereby allowing us to study patentability at the time of filing using NLP methods for the first time. It is also novel in its inclusion of rich structured metadata alongside the text of patent filings: By providing each application's metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks that leverage variation in structured covariates. As a case study on the types of research HUPD makes possible, we introduce a new task to the NLP community–namely, binary classification of patent decisions. We additionally show the structured metadata provided in the dataset enables us to conduct explicit studies of concept shifts for this task. Finally, we demonstrate how HUPD can be used for three additional tasks: multi-class classification of patent subject areas, language modeling, and summarization.

READ FULL TEXT

page 6

page 10

page 21

page 30

page 33

page 36

research
09/12/2022

CSL: A Large-scale Chinese Scientific Literature Dataset

Scientific literature serves as a high-quality corpus, supporting a lot ...
research
03/06/2020

NYTWIT: A Dataset of Novel Words in the New York Times

We present the New York Times Word Innovation Types dataset, or NYTWIT, ...
research
04/26/2020

MATINF: A Jointly Labeled Large-Scale Dataset for Classification, Question Answering and Summarization

Recently, large-scale datasets have vastly facilitated the development i...
research
02/22/2022

Deep learning classification of large-scale point clouds: A case study on cuneiform tablets

This paper introduces a novel network architecture for the classificatio...
research
03/17/2022

Dim Wihl Gat Tun: The Case for Linguistic Expertise in NLP for Underdocumented Languages

Recent progress in NLP is driven by pretrained models leveraging massive...
research
04/18/2021

Documenting the English Colossal Clean Crawled Corpus

As language models are trained on ever more text, researchers are turnin...
research
03/24/2021

Paragraph-level Rationale Extraction through Regularization: A case study on European Court of Human Rights Cases

Interpretability or explainability is an emerging research field in NLP....

Please sign up or login with your details

Forgot password? Click here to reset