Story Beyond the Eye: Glyph Positions Break PDF Text Redaction

06/05/2022
by   Maxwell Bland, et al.
0

In the past redaction involved the use of black or white markers or paper cut-outs to obscure content on physical paper. Today many redactions take place on digital PDF documents and redaction is often performed by software tools. Typical redaction tools remove text from PDF documents and draw a black or white rectangle in its place, mimicking a physical redaction. This practice is thought to be secure when the redacted text is removed and cannot be "copy-pasted" from the PDF document. We find this common conception is false – existing PDF redactions can be broken by precise measurements of non-redacted character positioning information. We develop a deredaction tool for automatically finding and breaking these vulnerable redactions. We report on 11 different redaction tools, finding the majority do not remove redaction-breaking information, including some Adobe Acrobat workflows. We empirically measure the information leaks, finding some redactions leak upwards of 15 bits of information, creating a 32,768-fold reduction in the space of potential redacted texts. We demonstrate a lower bound on the impact of these leaks via a 22,120 document study, including 18,975 Office of the Inspector General (OIG) investigation reports, where we find 769 vulnerable named-entity redactions. We find leaked information reduces the contents for 164 of these redacted names to less than 494 possibilities from a 7 million name dictionary. We show these findings impact by breaking redactions from the Epstein/Maxwell case, Manafort case, and a released Snowden document. Moreover, we develop an efficient algorithm for locating copy-pastable redactions and find over 100,000 poorly redacted words in US court documents. Current PDF text redaction methods are insufficient for named entity protection.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/20/2018

Semantic Document Clustering on Named Entity Features

Keyword-based information processing has limitations due to simple treat...
research
09/06/2023

Leave no Place Behind: Improved Geolocation in Humanitarian Documents

Geographical location is a crucial element of humanitarian response, out...
research
02/12/2020

Joint Embedding in Named Entity Linking on Sentence Level

Named entity linking is to map an ambiguous mention in documents to an e...
research
10/20/2022

Unsupervised Text Deidentification

Deidentification seeks to anonymize textual data prior to distribution. ...
research
05/04/2023

The Role of Global and Local Context in Named Entity Recognition

Pre-trained transformer-based models have recently shown great performan...
research
11/04/2022

Unintended Memorization and Timing Attacks in Named Entity Recognition Models

Named entity recognition models (NER), are widely used for identifying n...
research
12/15/2015

An Operator for Entity Extraction in MapReduce

Dictionary-based entity extraction involves finding mentions of dictiona...

Please sign up or login with your details

Forgot password? Click here to reset