A Privacy-Preserving Approach to Extraction of Personal Information through Automatic Annotation and Federated Learning

05/19/2021
by   Rajitha Hathurusinghe, et al.
0

We curated WikiPII, an automatically labeled dataset composed of Wikipedia biography pages, annotated for personal information extraction. Although automatic annotation can lead to a high degree of label noise, it is an inexpensive process and can generate large volumes of annotated documents. We trained a BERT-based NER model with WikiPII and showed that with an adequately large training dataset, the model can significantly decrease the cost of manual information extraction, despite the high level of label noise. In a similar approach, organizations can leverage text mining techniques to create customized annotated datasets from their historical data without sharing the raw data for human annotation. Also, we explore collaborative training of NER models through federated learning when the annotation is noisy. Our results suggest that depending on the level of trust to the ML operator and the volume of the available data, distributed training can be an effective way of training a personal information identifier in a privacy-preserved manner. Research material is available at https://github.com/ratmcu/wikipiifed.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/18/2022

FLAIR: Federated Learning Annotated Image Repository

Cross-device federated learning is an emerging machine learning (ML) par...
research
09/29/2022

Federated Stain Normalization for Computational Pathology

Although deep federated learning has received much attention in recent y...
research
04/19/2021

Federated Word2Vec: Leveraging Federated Learning to Encourage Collaborative Representation Learning

Large scale contextual representation models have significantly advanced...
research
12/17/2022

Modeling Global Distribution for Federated Learning with Label Distribution Skew

Federated learning achieves joint training of deep models by connecting ...
research
03/20/2020

FedNER: Privacy-preserving Medical Named Entity Recognition with Federated Learning

Medical named entity recognition (NER) has wide applications in intellig...
research
05/24/2023

A Human-in-the-Loop Approach for Information Extraction from Privacy Policies under Data Scarcity

Machine-readable representations of privacy policies are door openers fo...
research
01/04/2022

Semantics-Preserved Distortion for Personal Privacy Protection

Privacy protection is an important and concerning topic in Federated Lea...

Please sign up or login with your details

Forgot password? Click here to reset