Unsupervised Text Deidentification

10/20/2022
by   John X. Morris, et al.
0

Deidentification seeks to anonymize textual data prior to distribution. Automatic deidentification primarily uses supervised named entity recognition from human-labeled data points. We propose an unsupervised deidentification method that masks words that leak personally-identifying information. The approach utilizes a specially trained reidentification model to identify individuals from redacted personal documents. Motivated by K-anonymity based privacy, we generate redactions that ensure a minimum reidentification rank for the correct profile of the document. To evaluate this approach, we consider the task of deidentifying Wikipedia Biographies, and evaluate using an adversarial reidentification metric. Compared to a set of unsupervised baselines, our approach deidentifies documents more completely while removing fewer words. Qualitatively, we see that the approach eliminates many identifying aspects that would fall outside of the common named entity based approach.

READ FULL TEXT

page 2

page 8

page 9

research
12/20/2019

TreyNet: A Neural Model for Text Localization, Transcription and Named Entity Recognition in Full Pages

In the last years, the consolidation of deep neural network architecture...
research
08/24/2016

Robust Named Entity Recognition in Idiosyncratic Domains

Named entity recognition often fails in idiosyncratic domains. That caus...
research
10/12/2021

Investigation on Data Adaptation Techniques for Neural Named Entity Recognition

Data processing is an important step in various natural language process...
research
12/08/2021

Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents

The extraction of relevant information carried out by named entities in ...
research
06/05/2022

Story Beyond the Eye: Glyph Positions Break PDF Text Redaction

In the past redaction involved the use of black or white markers or pape...
research
07/02/2021

Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons

Unsupervised concept identification through clustering, i.e., identifica...
research
11/17/2018

Unnamed Entity Recognition of Sense Mentions

We consider the problem of recognizing mentions of human senses in text....

Please sign up or login with your details

Forgot password? Click here to reset