Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

04/23/2020
by   Mukund Srinath, et al.
0

Organisations disclose their privacy practices by posting privacy policies on their website. Even though users often care about their digital privacy, they often don't read privacy policies since they require a significant investment in time and effort. Although natural language processing can help in privacy policy understanding, there has been a lack of large scale privacy policy corpora that could be used to analyse, understand, and simplify privacy policies. Thus, we create PrivaSeer, a corpus of over one million English language website privacy policies, which is significantly larger than any previously available corpus. We design a corpus creation pipeline which consists of crawling the web followed by filtering documents using language detection, document classification, duplicate and near-duplication removal, and content extraction. We investigate the composition of the corpus and show results from readability tests, document similarity, keyphrase extraction, and explored the corpus through topic modeling.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/21/2022

Privacy Policies Across the Ages: Content and Readability of Privacy Policies 1996–2021

It is well-known that most users do not read privacy policies, but almos...
research
05/14/2020

APPCorp: A Corpus for Android Privacy Policy Document Structure Analysis

With the increasing popularity of mobile devices and the wide adoption o...
research
02/10/2023

Building cross-language corpora for human understanding of privacy policies

Making sure that users understand privacy policies that impact them is a...
research
05/25/2018

Modeling Language Vagueness in Privacy Policies using Deep Neural Networks

Website privacy policies are too long to read and difficult to understan...
research
08/19/2018

Automatic Detection of Vague Words and Sentences in Privacy Policies

Website privacy policies represent the single most important source of i...
research
03/02/2020

Cartolabe: A Web-Based Scalable Visualization of Large Document Collections

We describe CARTOLABE, a web-based multi-scale system for visualizing an...
research
06/30/2022

esCorpius: A Massive Spanish Crawling Corpus

In the recent years, transformer-based models have lead to significant a...

Please sign up or login with your details

Forgot password? Click here to reset