The United States Patent and Trademark Office (USPTO) hosts bulk data files for all patents published since 1976. This data holds interesting insights into a large number of fields but is unfortunately difficult to access on large scales. Files for each week can be hundreds of megabytes in size; additionally, files from different years are sometimes formatted distinctly. As a result of these challenges, potential users face a high barrier to entry prior to taking advantage of the available data. The patentpy and patentr packages simplify accessing USPTO data by providing programmatic interfaces that return it in a single rectangular, tidy  format and significantly reduce its storage size.
The patentpy package provides a Python interface with the functionality implemented using a combination of Python and C++. The patentr package does the same with an R interface. Since the two packages share functionality, they also share a C++ code base, with each depending on existing XML libraries in the respective language. Of note, Python and R both boast portability across multiple operating systems, making the corresponding packages easily available to a large number of users.
Currently, the most authoritative reference for patent analytics is the World Intellectual Property Organization (WIPO) Manual.  According to the WIPO Manual, multiple databases host patent data for exploration. These include The Lens, Patentscope, espacenet, LATIPAT, EPO Open Patent Services, and Google Patents, among others. Most of these web services provide user-friendly, graphical point-and-click interfaces. While initially easier to use, the time and effort scales proportionately to the number and complexity of searches conducted. Thus, these services become cumbersome when attempting to access large amount of patent data spanning over many years. In contrast, a programmatic interface would require a higher initial investment, but iteration would save significant additional effort for additional searches. This is the specific niche that patentpy and patentr address.
To demonstrate the utility of these packages, we will collect an arbitrarily selected set of patent data and conduct a preliminary analysis to answer four research questions: (1) how many patents are issued weekly; (2) which IPC classes grow fastest; (3) which IPC classes take longest to issue after application submission; and (4) whether the time between application and issuance changes over time. We hope these examples highlight that these packages can be effectively combined with available software packages to rapidly answer questions of interest. For the arbitrary dataset, we will pretend that the first 8 weeks of a year have special significance, and that we would like to focus our attention on these weeks for the first 5 years of available patent data. Collecting this data takes a single line of code for both packages using get_bulk_patent_data.
First, let us ask how many patents were issued weekly in the first 8 weeks of each year. We process the data by extracting the year from each patent’s issue date, aggregating and counting the number of patents issued each week, and feeding the resulting data to a visualization library. A boxplot of the weekly aggregated patent issue counts split by year results in Figure 1. We see that roughly 1100-1400 patents were consistently granted in the first 8 weeks of the year between 1976 and 1980, inclusive.
Second, let us ask which classes of patents are growing the fastest. The USPTO includes classifications based on the hierarchical International Patent Classification (IPC) system. The USPTO reports the International Patent Classification (IPC); this WIPO system hierarchically classifies patents according to their relevant scientific discipline and is updated annually. From our dataset of patents from the first 8 weeks of each of 5 years, we count the most commonly appearing classes and visualize the 10 most frequent results in Figure 2. The C class corresponds to “Chemistry; Metallurgy”, within which C07 corresponds to “Organic Chemistry”, within which C07D corresponds to “Heterocyclic Compounds”; C07C corresponds to “Acyclic or Carbocyclic Compounds”. Thus, the two fastest growing IPC classes in our dataset were both related to the field of organic chemistry.
Third, we can continue our exploration of the IPC classes by asking whether patents in different classes undergo reviews of different durations prior to issuance. After a patent application is submitted, it must first be manually approved and classified before being granted. We denote this time between submission and issuance as the lag time. For the sake of simplicity, we filter out patents that do not belong to one of the 10 classes shown in Figure 2. We calculate the lag time between application and issuance, then visualize the results as boxplots analogous to Figure 1. These results are shown in Figure 3
. The lag time distribution for each of the 10 IPC classes is skewed right with a median duration under 2 years.
Fourth, we can ask whether the lag time distribution changes across years. To answer this question, we calculate the lag time between application and issuance - this time without filtering out any patents from our collected database - and visualize the results as boxplots (Figure 4). Here again, we note that the distributions are skewed right with medians under 2 years. However, we also note that the boxplot for 1980 is slightly shifted upward. Following this forward may reveal a trend of increasing lag time as years progress. Pursuing this research question remains an open question for interested readers.
Implementation and architecture
The patentpy package was implemented in Python 3 and uses pybind11 to incorporate C++.  The C++ code parses USPTO files with the TXT extension and the lxml library parses USPTO XML files.  The data frame structure provided by the pandas library encapsulates rectangular data returned to the user.  The patentr package was implemented in R 4 and uses Rcpp to incorporate C++.  The C++ code parses USPTO files with the TXT extension and the xml2 libary parses USPTO XML files.  Tibbles form the core data structure for patentr to store and return rectangular data to the user. Tibble functionality depends on the tidyverse set of R packages. 
Both packages extract the same data for each patent and format it the same way. Specifically, both return a rectangular data object with columns representing a unique patent identifier (WKU), title, application date, issue date, inventor(s), assignee(s), IPC class(es), reference(s), and claim(s). Additionally, multiple values in the same column (e.g. multiple references), are delimited by semi-colons in both programs, except for claims about the patent’s novelty made by the preparer, for which text formatting is preserved. Thus, when returned data from each package is saved as a comma-separated values (CSV) file, outputs from both packages should be roughly equivalent.
The main entry point of the programmatic interface for each package is the get_bulk_patent_data function, which downloads, parses, formats, and returns USPTO bulk patent data based on the USPTO Green, Yellow, and Red books.  Due to the portability of Python and R, both packages are available across multiple operating systems and architectures. [10, 11]
Automated unit testing is implemented for patentpy, supported by Travis CI continuous integration. Package functionality has been checked on the Windows, Mac, Ubuntu, and Debian operating systems. Runtime errors are caught and returned with appropriate error messages based on error type. Codecov implements code coverage. Users can manually confirm package functionality by locally running sample code provided in the README file or by downloading and locally running test scripts, both of which are provided in patentpy’s code repository.
Automated unit testing is implemented for patentr, supported by Travis CI and AppVeyor CI continuous integration. Package functionality has been checked on the Windows, Mac, Ubuntu, and Debian operating systems with regular checking also completed by the Comprehensive R Archive Network (CRAN). Codecov implements code coverage. The package is automatically checked with unit tests upon installation from CRAN. Users can manually confirm package functionality by locally running sample code provided in package documentation and the repository README file or by locally running unit tests using the testthat package. 
Availability and dependencies
patentpy is available in the Python programming language (version ) via the Python Package Index (PyPI) and patentr is available in the R programming language (version via the Comprehensive R Archive Network (CRAN); the source code of each package is publicly available at https://github.com/JYProjs/patentpy and https://github.com/JYProjs/patentr, respectively. patentpy depends on pandas (version ) and lxml (version ). [4, 5] patentr depends on covr (version ), dplyr (version ), knitr (version ), lubridate (version ), magrittr (version ), progress (version ), Rcpp (version ), readr (version ), rlang (version ), rmarkdown (version ), testthat (version ), utils (version ), and xml2 (version ). [12, 8, 13, 14, 15, 6, 16, 17, 7]
Given the large amount of rich data available for patents and the significant financial value that patents can hold, multiple research avenues have already been pursued with tools analogous to patentpy and patentr. However, with their introduction to the research literature, each individual project should no longer require writing an entire base of code, saving time and resources. The explored avenues of research range from studying international patent families  and the network structure of patent citations  of to patent protection for new forms of technology [21, 22, 23, 24] and management of patent portfolios. [25, 26] Although USPTO bulk data files are identified by patent publication date, the patentpy and patentr packages could be used for data collection in any of the aforementioned research avenues.
Within the field of scientometrics, further analysis of co-citation networks would be useful in the sub-field of bibliometrics. Studying innovation within specific scientific fields over time and identifying crucial patents to the development of new technologies would also be useful. In particular, identifying prominent patent suites and oligopolies of corporate patent assignees within scientific fields could result in illuminating research at the intersection of scientometrics and economics. Outside of academia, scientists in industry could use patent data to study innovation and guide research development efforts based on evidence-based predictions. The potential aforementioned pathways would all benefit from the use of either patentpy or patentr for data collection purposes.
The presented packages are well complemented by available software for network analysis, topological data analysis, and machine learning. As previously mentioned, patent data can be structured as a citation network or a co-citation network. These structures can then be analyzed with the existing foundation of network and graph theory to gain insight into the evolutionof the patent network over time and the change in dynamic metrics on the patent network. The igraph package implements this functionality and would work well with patentpy and patentr in a single pipeline.  Topological data analysis is a growing field, with an increasing number of applications of the Mapper algorithm  and persistent homology  appearing in the scientific literature. Analyzing the topology of patent networks could be completed well with packages like scikit-tda in Python and TDAstats in R. [28, 29]
Machine learning pipelines have the ability to incorporate network analytics and topological features; even without these, machine learning has high potential use with patent data. Supervised learning and natural language processing can be combined to automate classification of patents based on title, references, and claim text. Reinforcement learning can be combined with ranking algorithms to measure the importance of a patent over time.[35, 36]
Even unsupervised learning can be applied to find patent clusters across IPC classes and identify potential relationships between otherwise unrelated patents. The scikit-learn package in Python and the caret and tidymodels packages in R would be candidate packages to perform such analyses on data acquired via the packages introduced in this report.[30, 31, 32] Useful extensions and improvements to patentpy and patentr would include the addition of a shared interface to coordinate pipelines across the aforementioned packages, depending on the technique used to answer a research question.
Contributions to patentpy and patentr are welcome and can be made via the code repository for each package. Contributors can open pull requests to incorporate additions or changes. Issues can also be opened to make feature requests and identify bugs. The authors of this report bear responsibility for maintaining patentpy and patentr, and we welcome community involvement. Support mechanisms include continuous integration with the Travis and AppVeyor tools to validate package function over time, particularly as the programming languages and software dependencies evolve.
This project was supported by the NIH R37 grant CA244613. The authors declare that they have no competing interests.
-  Wickham, H 2014 Tidy Data. Journal of Statistical Software 59(10): 1-23. DOI: https://doi.org/10.18637/jss.v059.i10.
Oldham, P, Kitsara I 2016 The WIPO Manual on Open Source Patent Analytics.https://wipo-analytics.github.io.
-  Jakob, W, Rhinelander, J, Moldovan, D 2017 pybind11 - Seamless Operability Between C++ and Python. https://github.com/pybind/pybind11.
-  Behnel, S, Faassen, M, Bicking, I 2005 lxml: XML and HTML with Python. https://github.com/lxml/lxml.
McKinney, W 2010 Data structures for statistical computing in python. Proceedings of the 9th Python in Science Conference 445: 51. DOI:
-  Eddelbuettel, D, François, R 2011 Rcpp: Seamless R and C++ Integration. Journal of Statistical Software 40(8): 1. DOI: https://doi.org/10.18637/jss.v040.i08.
-  Wickham, H, Hester, J, Ooms, J 2020 xml2: Parse XML. Comprehensive R Archive Network. https://CRAN.R-project.org/package=xml2.
-  Wickham, H, Averick, M, Bryan, J, et al 2019 Welcome to the tidyverse. Journal of Open Source Software 4(43): 1686. DOI: https://doi.org/10.21105/joss.01686.
-  XML Resources | USPTO 2021 https://www.uspto.gov/learning-and-resources/xml-resources.
-  Python Core Team 2015 Python: A Dynamic, Open Source Programming Language. Python Software Foundation https://www.python.org.
-  R Core Team 2014 R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing http://www.R-project.org.
-  Hester, J 2020 covr: Test Coverage for Packages. Comprehensive R Archive Network. https://CRAN.R-project.org/package=covr.
-  Xie, Y 2015 Dynamic Documents with R and knitr, 2nd ed. Chapman and Hall/CRC.
-  Bache, S M, Wickham, H 2020 magrittr: A Forward-Pipe Operator for R. Comprehensive R Archive Network. https://CRAN.R-project.org/package=magrittr.
-  Csardi, G, FitzJohn, R 2020 progress: Terminal Progress Bars. Comprehensive R Archive Network. https://CRAN.R-project.org/package=progress.
-  Henry, L, Wickham, H 2021 rlang: Functions for Base Types and Core R and ‘Tidyverse’ Features. Comprehensive R Archive Network. https://CRAN.R-project.org/package=rlang.
-  Xie, Y, Allaire, J J, Grolemund, G 2018 R Markdown: The Definitive Guide. Chapman and Hall/CRC. URL: https://bookdown.org/yihui/rmarkdown.
-  Wickham, H 2011 testthat: Get Started with Testing. R Journal 3: 5. DOI: https://doi.org/10.32614/RJ-2011-002.
-  Dechezleprêtre, A, Ménière, Y, Mohnen, M 2017 International Patent Families: From Application Strategies to Statistical Indicators. Scientometrics 111(1): 793. DOI: https://doi.org/10.1007/s11192-017-2311-4.
-  Bruck, P, Réthy, I, Szente, J, Tobochnik, J, Érdi, P 2016 Recognition of Emerging Technology Trends: Class-selective Study of Citations in the U.S. Patent Citation Network. Scientometrics 107: 1465. DOI: https://doi.org/10.1007/s11192-016-1899-0.
-  Sherkow, J S 2017 Patent Protection for Microbial Technologies. FEMS Microbiol Lett 364(20). DOI: https://doi.org/10.1093/femsle/fnx205.
-  Singh, R, Brumlik, C, Vaidya, M, Choudhury, A 2020 A Patent Review on Nanotechnology-based Nose-to-Brain Drug Delivery. Recent Pat Nanotechnol 14(3): 174. DOI: https://doi.org/10.2174/1872210514666200508121050.
-  Lahrtz, F 2015 How to Successfully Patent Therapeutic Antibodies. J Biomol Screen 20(4): 484. DOI: https://doi.org/10.1177/1087057114567457.
-  Krauß, J, Kuttenkeuler, D 2021 When to File for a Patent? The Scientist’s Perspective. N Biotechnol 60: 124. DOI: https://doi.org/10.1016/j.nbt.2020.10.006.
-  Conegundes de Jesus, C K, Salerno, M S 2018 Patent Portfolio Management: Literature Review and a Proposed Model. Expert Opin Ther Pat 28(6): 505. DOI: https://doi.org/10.1080/13543776.2018.1472238.
Weingarten, M D, Cyr, S K 2019 Securing and Maintaining a Strong Patent Portfolio for Pharmaceuticals. ACS Med Chem Lett 10(6): 838. DOI:
-  Csardi, G, Nepusz, T 2006 The igraph Software Package for Complex Network Research. InterJournal Complex Systems: 1695.
-  Saul, N, Tralie, C 2019 Scikit-TDA: Topological Data Analysis for Python. DOI: https://doi.org/10.5281/zenodo.2533369.
-  Wadhwa, R R, Williamson, D F K, Dhawan A, Scott J G 2018 TDAstats: R Pipeline for Computing Persistent Homology in Topological Data Analysis. Journal of Open Source Software 3(28): 860. DOI: https://doi.org/10.21105/joss.00860.
-  Pedregosa, F, Varoquaux, G, Gramfort, A, et al 2011 Scikit-learn: Machine Learning in Python. Journal of Machine Learning 12: 2825.
-  Kuhn, M 2021 caret: Classification and Regression Training. Comprehensive R Archive Network. https://CRAN.R-project.org/package=caret.
-  Kuhn, M, Wickham, H 2020 Tidymodels: A Collection of Packages for Modeling and Machine Learning Using tidyverse Principles. Comprehensive R Archive Network. https://CRAN.R-project.org/package=tidymodels.
Singh, G, Memoli, F, Carlsson, G 2007 Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. Eurographics Symposium on Point-Based Graphics. DOI:https://doi.org/10.2312/SPBG/SPBG07/091-100.
-  Carlsson, G 2009 Topology and data. Bulletin of the American Mathematical Society 46: 255-308. DOI: https://doi.org/10.1090/S0273-0979-09-01249-X.
Beltz, H, Fulop, A, Wadhwa, R R, Erdi, P 2017 From Ranking and Clustering of Evolving Networks to Patent Citation Analysis. 2017 International Joint Conference on Neural Networks: 1388-1394. DOI:https://doi.org/10.1109/IJCNN.2017.7966015.
-  Beltz, H, Rutledge, T, Wadhwa, R R, Bruck, P, Tobochnik J, Fulop A, Fenyvesi, G, Erdi, P 2019 Ranking Algorithms: Application for Patent Citation Network. In: Information Quality in Information Fusion and Decision Making: 519-538.
-  International Patent Classification (IPC). World Intellectual Property Organization. https://www.wipo.int/classifications/ipc/en.