Grappling with the Scale of Born-Digital Government Publications: Toward Pipelines for Processing and Searching Millions of PDFs

Official government publications are key sources for understanding the history of societies. Web publishing has fundamentally changed the scale and processes by which governments produce and disseminate information. Significantly, a range of web archiving programs have captured massive troves of government publications. For example, hundreds of millions of unique U.S. Government documents posted to the web in PDF form have been archived by libraries to date. Yet, these PDFs remain largely unutilized and understudied in part due to the challenges surrounding the development of scalable pipelines for searching and analyzing them. This paper utilizes a Library of Congress dataset of 1,000 government PDFs in order to offer initial approaches for searching and analyzing these PDFs at scale. In addition to demonstrating the utility of PDF metadata, this paper offers computationally-efficient machine learning approaches to search and discovery that utilize the PDFs' textual and visual features as well. We conclude by detailing how these methods can be operationalized at scale in order to support systems for navigating millions of PDFs.

READ FULL TEXT

page 6

page 17

page 19

research
01/29/2022

An Open Data and Geo-based Information Systems

Barangay is the smallest type of government in the Philippines, and it i...
research
10/01/2022

Digital Library Initiatives in North East India: A Survey

This is a survey of digital library initiative of North East India. The ...
research
11/28/2011

Graph based E-Government web service composition

Nowadays, e-government has emerged as a government policy to improve the...
research
11/04/2019

Spatial Search Strategies for Open Government Data: A Systematic Comparison

The increasing availability of open government datasets on the Web calls...
research
08/02/2023

Industrial Memories: Exploring the Findings of Government Inquiries with Neural Word Embedding and Machine Learning

We present a text mining system to support the exploration of large volu...

Please sign up or login with your details

Forgot password? Click here to reset