Accessing United States Bulk Patent Data with patentpy and patentr

by   James Yu, et al.

The United States Patent and Trademark Office (USPTO) provides publicly accessible bulk data files containing information for all patents from 1976 onward. However, the format of these files changes over time and is memory-inefficient, which can pose issues for individual researchers. Here, we introduce the patentpy and patentr packages for the Python and R programming languages. They allow users to programmatically fetch bulk data from the USPTO website and access it locally in a cleaned, rectangular format. Research depending on United States patent data would benefit from the use of patentpy and patentr. We describe package implementation, quality control mechanisms, and present use cases highlighting simple, yet effective, applications of this software.



There are no comments yet.


page 1

page 2

page 3

page 4


Mapping Spreadsheets to RDF: Supporting Excel in RML

The RDF Mapping Language (RML) enables, among other formats, the mapping...

Integrating Structural Description of Data Format Information into Programming to Auto-generate File Reading Programs

File reading is the basis for data sharing and scientific computing. How...

SigViewer: Visualizing Multimodal Signals Stored in XDF (Extensible Data Format) Files

Multimodal biosignal acquisition is facilitated by recently introduced s...

SARS-CoV-2 Dissemination using a Network of the United States Counties

During 2020 and 2021, severe acute respiratory syndrome coronavirus 2 (S...

ChemKED: a human- and machine-readable data standard for chemical kinetics experiments

Fundamental experimental measurements of quantities such as ignition del...

The Second Amendment and Cyber Weapons - The Constitutional Relevance of Digital Gun Rights

In the future, the United States government can seek to limit the owners...

Decomposing Real Wage Changes in the United States

We employ CPS data to analyze the sources of hourly real wage changes in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The United States Patent and Trademark Office (USPTO) hosts bulk data files for all patents published since 1976. This data holds interesting insights into a large number of fields but is unfortunately difficult to access on large scales. Files for each week can be hundreds of megabytes in size; additionally, files from different years are sometimes formatted distinctly. As a result of these challenges, potential users face a high barrier to entry prior to taking advantage of the available data. The patentpy and patentr packages simplify accessing USPTO data by providing programmatic interfaces that return it in a single rectangular, tidy [1] format and significantly reduce its storage size.

The patentpy package provides a Python interface with the functionality implemented using a combination of Python and C++. The patentr package does the same with an R interface. Since the two packages share functionality, they also share a C++ code base, with each depending on existing XML libraries in the respective language. Of note, Python and R both boast portability across multiple operating systems, making the corresponding packages easily available to a large number of users.

Currently, the most authoritative reference for patent analytics is the World Intellectual Property Organization (WIPO) Manual. [2] According to the WIPO Manual, multiple databases host patent data for exploration. These include The Lens, Patentscope, espacenet, LATIPAT, EPO Open Patent Services, and Google Patents, among others. Most of these web services provide user-friendly, graphical point-and-click interfaces. While initially easier to use, the time and effort scales proportionately to the number and complexity of searches conducted. Thus, these services become cumbersome when attempting to access large amount of patent data spanning over many years. In contrast, a programmatic interface would require a higher initial investment, but iteration would save significant additional effort for additional searches. This is the specific niche that patentpy and patentr address.

To demonstrate the utility of these packages, we will collect an arbitrarily selected set of patent data and conduct a preliminary analysis to answer four research questions: (1) how many patents are issued weekly; (2) which IPC classes grow fastest; (3) which IPC classes take longest to issue after application submission; and (4) whether the time between application and issuance changes over time. We hope these examples highlight that these packages can be effectively combined with available software packages to rapidly answer questions of interest. For the arbitrary dataset, we will pretend that the first 8 weeks of a year have special significance, and that we would like to focus our attention on these weeks for the first 5 years of available patent data. Collecting this data takes a single line of code for both packages using get_bulk_patent_data.

First, let us ask how many patents were issued weekly in the first 8 weeks of each year. We process the data by extracting the year from each patent’s issue date, aggregating and counting the number of patents issued each week, and feeding the resulting data to a visualization library. A boxplot of the weekly aggregated patent issue counts split by year results in Figure 1. We see that roughly 1100-1400 patents were consistently granted in the first 8 weeks of the year between 1976 and 1980, inclusive.

Figure 1: Distributions of number of patents issued weekly for the first 8 weeks between 1976 and 1980, inclusive.

Second, let us ask which classes of patents are growing the fastest. The USPTO includes classifications based on the hierarchical International Patent Classification (IPC) system. The USPTO reports the International Patent Classification (IPC); this WIPO system hierarchically classifies patents according to their relevant scientific discipline and is updated annually. 

[37] From our dataset of patents from the first 8 weeks of each of 5 years, we count the most commonly appearing classes and visualize the 10 most frequent results in Figure 2. The C class corresponds to “Chemistry; Metallurgy”, within which C07 corresponds to “Organic Chemistry”, within which C07D corresponds to “Heterocyclic Compounds”; C07C corresponds to “Acyclic or Carbocyclic Compounds”. Thus, the two fastest growing IPC classes in our dataset were both related to the field of organic chemistry.

Third, we can continue our exploration of the IPC classes by asking whether patents in different classes undergo reviews of different durations prior to issuance. After a patent application is submitted, it must first be manually approved and classified before being granted. We denote this time between submission and issuance as the lag time. For the sake of simplicity, we filter out patents that do not belong to one of the 10 classes shown in Figure 2. We calculate the lag time between application and issuance, then visualize the results as boxplots analogous to Figure 1. These results are shown in Figure 3

. The lag time distribution for each of the 10 IPC classes is skewed right with a median duration under 2 years.

Fourth, we can ask whether the lag time distribution changes across years. To answer this question, we calculate the lag time between application and issuance - this time without filtering out any patents from our collected database - and visualize the results as boxplots (Figure 4). Here again, we note that the distributions are skewed right with medians under 2 years. However, we also note that the boxplot for 1980 is slightly shifted upward. Following this forward may reveal a trend of increasing lag time as years progress. Pursuing this research question remains an open question for interested readers.

Figure 2: Number of patents issued in each International Patent Classification (IPC) class. Only counts for the 10 most frequent classes are visualized. Only patents issued in the first 8 weeks of the years between 1976 and 1980, inclusive, are included.

Implementation and architecture

The patentpy package was implemented in Python 3 and uses pybind11 to incorporate C++. [3] The C++ code parses USPTO files with the TXT extension and the lxml library parses USPTO XML files. [4] The data frame structure provided by the pandas library encapsulates rectangular data returned to the user. [5] The patentr package was implemented in R 4 and uses Rcpp to incorporate C++. [6] The C++ code parses USPTO files with the TXT extension and the xml2 libary parses USPTO XML files. [7] Tibbles form the core data structure for patentr to store and return rectangular data to the user. Tibble functionality depends on the tidyverse set of R packages. [8]

Both packages extract the same data for each patent and format it the same way. Specifically, both return a rectangular data object with columns representing a unique patent identifier (WKU), title, application date, issue date, inventor(s), assignee(s), IPC class(es), reference(s), and claim(s). Additionally, multiple values in the same column (e.g. multiple references), are delimited by semi-colons in both programs, except for claims about the patent’s novelty made by the preparer, for which text formatting is preserved. Thus, when returned data from each package is saved as a comma-separated values (CSV) file, outputs from both packages should be roughly equivalent.

The main entry point of the programmatic interface for each package is the get_bulk_patent_data function, which downloads, parses, formats, and returns USPTO bulk patent data based on the USPTO Green, Yellow, and Red books. [9] Due to the portability of Python and R, both packages are available across multiple operating systems and architectures. [10, 11]

Figure 3: Distributions for time between patent application date and issue date, grouped by IPC class. Only patents issued in the first 8 weeks of the years between 1976 and 1980, inclusive, were included.

Quality control

Automated unit testing is implemented for patentpy, supported by Travis CI continuous integration. Package functionality has been checked on the Windows, Mac, Ubuntu, and Debian operating systems. Runtime errors are caught and returned with appropriate error messages based on error type. Codecov implements code coverage. Users can manually confirm package functionality by locally running sample code provided in the README file or by downloading and locally running test scripts, both of which are provided in patentpy’s code repository.

Automated unit testing is implemented for patentr, supported by Travis CI and AppVeyor CI continuous integration. Package functionality has been checked on the Windows, Mac, Ubuntu, and Debian operating systems with regular checking also completed by the Comprehensive R Archive Network (CRAN). Codecov implements code coverage. The package is automatically checked with unit tests upon installation from CRAN. Users can manually confirm package functionality by locally running sample code provided in package documentation and the repository README file or by locally running unit tests using the testthat package. [18]

Availability and dependencies

patentpy is available in the Python programming language (version ) via the Python Package Index (PyPI) and patentr is available in the R programming language (version via the Comprehensive R Archive Network (CRAN); the source code of each package is publicly available at and, respectively. patentpy depends on pandas (version ) and lxml (version ). [4, 5] patentr depends on covr (version ), dplyr (version ), knitr (version ), lubridate (version ), magrittr (version ), progress (version ), Rcpp (version ), readr (version ), rlang (version ), rmarkdown (version ), testthat (version ), utils (version ), and xml2 (version ). [12, 8, 13, 14, 15, 6, 16, 17, 7]

Reuse potential

Given the large amount of rich data available for patents and the significant financial value that patents can hold, multiple research avenues have already been pursued with tools analogous to patentpy and patentr. However, with their introduction to the research literature, each individual project should no longer require writing an entire base of code, saving time and resources. The explored avenues of research range from studying international patent families [19] and the network structure of patent citations [20] of to patent protection for new forms of technology [21, 22, 23, 24] and management of patent portfolios. [25, 26] Although USPTO bulk data files are identified by patent publication date, the patentpy and patentr packages could be used for data collection in any of the aforementioned research avenues.

Within the field of scientometrics, further analysis of co-citation networks would be useful in the sub-field of bibliometrics. Studying innovation within specific scientific fields over time and identifying crucial patents to the development of new technologies would also be useful. In particular, identifying prominent patent suites and oligopolies of corporate patent assignees within scientific fields could result in illuminating research at the intersection of scientometrics and economics. Outside of academia, scientists in industry could use patent data to study innovation and guide research development efforts based on evidence-based predictions. The potential aforementioned pathways would all benefit from the use of either patentpy or patentr for data collection purposes.

The presented packages are well complemented by available software for network analysis, topological data analysis, and machine learning. As previously mentioned, patent data can be structured as a citation network or a co-citation network. These structures can then be analyzed with the existing foundation of network and graph theory to gain insight into the evolution

of the patent network over time and the change in dynamic metrics on the patent network. The igraph package implements this functionality and would work well with patentpy and patentr in a single pipeline. [27] Topological data analysis is a growing field, with an increasing number of applications of the Mapper algorithm [33] and persistent homology [34] appearing in the scientific literature. Analyzing the topology of patent networks could be completed well with packages like scikit-tda in Python and TDAstats in R. [28, 29]

Machine learning pipelines have the ability to incorporate network analytics and topological features; even without these, machine learning has high potential use with patent data. Supervised learning and natural language processing can be combined to automate classification of patents based on title, references, and claim text. Reinforcement learning can be combined with ranking algorithms to measure the importance of a patent over time. 

[35, 36]

Even unsupervised learning can be applied to find patent clusters across IPC classes and identify potential relationships between otherwise unrelated patents. The scikit-learn package in Python and the caret and tidymodels packages in R would be candidate packages to perform such analyses on data acquired via the packages introduced in this report. 

[30, 31, 32] Useful extensions and improvements to patentpy and patentr would include the addition of a shared interface to coordinate pipelines across the aforementioned packages, depending on the technique used to answer a research question.

Contributions to patentpy and patentr are welcome and can be made via the code repository for each package. Contributors can open pull requests to incorporate additions or changes. Issues can also be opened to make feature requests and identify bugs. The authors of this report bear responsibility for maintaining patentpy and patentr, and we welcome community involvement. Support mechanisms include continuous integration with the Travis and AppVeyor tools to validate package function over time, particularly as the programming languages and software dependencies evolve.

Figure 4: Distributions for time between patent application date and issue date, grouped by issue year. Only patents issued in the first 8 weeks of the years between 1976 and 1980, inclusive, were included.


This project was supported by the NIH R37 grant CA244613. The authors declare that they have no competing interests.