The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

01/25/2022
by   Ildikó Pilán, et al.
8

We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared to previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored towards measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymisation-benchmark

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/29/2020

Reference and Document Aware Semantic Evaluation Methods for Korean Language Summarization

Text summarization refers to the process that generates a shorter form o...
research
12/19/2022

LENS: A Learnable Evaluation Metric for Text Simplification

Training learnable metrics using modern language models has recently eme...
research
10/07/2022

Longtonotes: OntoNotes with Longer Coreference Chains

Ontonotes has served as the most important benchmark for coreference res...
research
06/06/2017

Marmara Turkish Coreference Corpus and Coreference Resolution Baseline

We describe the Marmara Turkish Coreference Corpus, which is an annotati...
research
09/10/2023

GenAIPABench: A Benchmark for Generative AI-based Privacy Assistants

Privacy policies inform users about the data management practices of org...
research
07/23/2019

Overview and Results: CL-SciSumm Shared Task 2019

The CL-SciSumm Shared Task is the first medium-scale shared task on scie...
research
04/22/2016

SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies

We present a new resource for Swedish, SweLL, a corpus of Swedish Learne...

Please sign up or login with your details

Forgot password? Click here to reset