DeepAI
Log In Sign Up

PoliGraph: Automated Privacy Policy Analysis using Knowledge Graphs

10/13/2022
by   Hao Cui, et al.
0

Privacy policies disclose how an organization collects and handles personal information. Recent work has made progress in leveraging natural language processing (NLP) to automate privacy policy analysis and extract collection statements from different sentences, considered in isolation from each other. In this paper, we view and analyze, for the first time, the entire text of a privacy policy in an integrated way. In terms of methodology: (1) we define PoliGraph, a type of knowledge graph that captures different relations between different parts of the text in a privacy policy; and (2) we develop an NLP-based tool, PoliGraph-er, to automatically extract PoliGraph from the text. In addition, (3) we revisit the notion of ontologies, previously defined in heuristic ways, to capture subsumption relations between terms. We make a clear distinction between local and global ontologies to capture the context of individual privacy policies, application domains, and privacy laws. Using a public dataset for evaluation, we show that PoliGraph-er identifies 61 collection statements than prior state-of-the-art, with over 90 terms of applications, PoliGraph enables automated analysis of a corpus of privacy policies and allows us to: (1) reveal common patterns in the texts across different privacy policies, and (2) assess the correctness of the terms as defined within a privacy policy. We also apply PoliGraph to: (3) detect contradictions in a privacy policy-we show false positives by prior work, and (4) analyze the consistency of privacy policies and network traffic, where we identify significantly more clear disclosures than prior work.

READ FULL TEXT VIEW PDF
03/14/2019

Analysis of Privacy Policies to Enhance Informed Consent

In this report, we present an approach to enhance informed consent for t...
03/14/2019

Analysis of Privacy Policies to Enhance Informed Consent (Extended Version)

In this report, we present an approach to enhance informed consent for t...
08/20/2020

Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset

Automated analysis of privacy policies has proved a fruitful research di...
07/17/2018

Power Networks: A Novel Neural Architecture to Predict Power Relations

Can language analysis reveal the underlying social power relations that ...
10/01/2020

Beyond The Text: Analysis of Privacy Statements through Syntactic and Semantic Role Labeling

This paper formulates a new task of extracting privacy parameters from a...
05/25/2018

Modeling Language Vagueness in Privacy Policies using Deep Neural Networks

Website privacy policies are too long to read and difficult to understan...
11/23/2021

Identifying Terms and Conditions Important to Consumers using Crowdsourcing

Terms and conditions (T Cs) are pervasive on the web and often contain...

1 Introduction

Privacy Policies

Privacy laws, such as the General Data Protection Regulation (GDPR) [1], the California Consumer Privacy Act (CCPA) [2], and other state or sector-specific data protection laws, require organizations to disclose the personal information they collect, as well as how and why they use and share it. Privacy policies are the primary legally-binding way for organizations to disclose their data collection practices to the users of their products. As such, these privacy policies receive a lot of attention from many stakeholders, such as users who want to exercise their rights, developers who want to write privacy policies for their systems and be compliant with privacy laws, and law enforcement agencies who want to audit an organization’s actual data collection practices so that they can hold the organization accountable. Unfortunately, privacy policies are typically lengthy and complicated, making it hard not only for the average user to understand, but also for experts to analyze in depth and at scale [3].

[width=0.32page=3]figures/poligraph-figures.pdf-crop.pdf [width=0.32page=4]figures/poligraph-figures.pdf-crop.pdf [width=0.335page=5]figures/poligraph-figures.pdf-crop.pdf
(a) (b) (c)
Figure 1: Example of a privacy policy and analysis approaches. (a) The excerpt is from the policy of KAYAK [4], an online travel agency. It contains sections and lists, regarding: what is collected (data type), how it is used (purpose), who receives the information (entity), and references across sentences (e.g., “personal information” relates to other data types; “this information” refers to “location”). (b) Prior state-of-the-art work extracts elements found in each sentence, mainly data types and entities, as disconnected tuples. Purposes can also be extracted to extend the tuple [5, 6]. (c) PoliGraph is a knowledge graph that encodes data types, entities, and purposes; and two types of relations between them (collection and subsumption), possibly specified across different sentences. COLLECT edges represent that a data type is collected by an entity, while edge attributes represent the purposes of that collection. SUBSUME edges represent the relation between general and specific terms, for entities or data types.
NLP Analysis and Limitations

To address this challenge, as well as facilitate expert analysis [7] and crowdsourced annotation [8], the research community has recently applied natural language processing (NLP) to automate the analysis of privacy policies. State-of-the-art examples include the following: PolicyLint [9] extracts data types and entities that collect them, and analyzes potential contradictions within a privacy policy; PoliCheck [10] builds on PolicyLint and further compares the statements made in the privacy policy with the data collection practices observed in the network traffic; Polisis [11] and PurPliance [5] extract data collection purposes; and OVRseen [6] leverages PoliCheck and Polisis to associate data types, entities, and purposes. A detailed review of prior work can be found in Section 5. Although this body of work has shown promising results, they come with certain limitations that we will discuss next.

First, they typically partition a privacy policy into sentences and process each sentence to extract information from that sentence, namely what is collected (data type), who collects it (entity), and for what purposes. Figure 1(a) depicts an example of a typical privacy policy. Existing approaches extract a tuple from each sentence independently, as shown in Figure 1(b), but are unable to link the extracted information from different sentences together to understand the full context. Other approaches, such as Polisis [11], process a privacy policy per text segment instead of per sentence, but still analyze each segment individually and thus also miss the full context.

Second, because of this incomplete context, prior work needs to map and relate the semantics of the terms across different sentences/tuples by introducing ontologies that encode subsumption relations between data types or entities. So far, these ontologies have been built in a manual or semi-automated fashion by domain experts, who define lists of terms commonly found in privacy policy text and other sources (e.g., network traffic), and subsumption relations between them (e.g., the term “device information” subsumes “IP address”). The resulting ontologies are not universal: they do not necessarily agree with all privacy policies and need to be adapted to different application domains, e.g., mobile  [9, 10, 5], smart speakers [12, 13], and VR [6]. As a result, they often generate ambiguous or wrong results that require further validation by experts. Manandhar et al. [14] recently reported that state-of-the-art analyzers [9, 10, 11] incorrectly reason about more than half of the privacy policies they analyzed. We further discuss the limitations of ontologies in Section 2.2.

The PoliGraph Framework

Our key observation is that a policy111For the rest of the paper, we simply refer to privacy policy as “policy”. should be treated in its entirety, leveraging terms in different sentences that are related. To that end, we make the following methodological contributions.

First, we propose to extract and encode the information disclosed in a policy (i.e., what data types are collected, with what entities they are shared, and for what purposes) into a knowledge graph [15, 16], which we refer to as PoliGraph; Figure 1(c) shows an example.222For example, “we” (first party, i.e., KAYAK) collects “personal information”, with the purposes “provide services” and “authenticate your account”; “travel partners” collect (or precisely, are disclosed with) “personal information”. The other type of edges, called SUBSUME edges, represent subsumption relations, e.g., “personal information” subsumes “location” and “device information”, which in turn subsumes “IP address”. Nodes are terms representing data types or entities. The edges represent relations between nodes, e.g., an entity may collect a particular data type, and a more generic data type may subsume a more specific data type. An edge representing data collection may have an attribute indicating the purposes. We discuss these in Section 2.1.

Second, for policies that are not well written, the extracted PoliGraph may be missing subsumption relations between terms that are not fully defined in the policies. To supplement the missing relations, as in prior work [9, 10, 5], we use ontologies; however, we redefine and use them as follows. First, we consider the subsumption relations extracted from each individual policy and the local ontology it defines. Next, we also define additional subsumption relations that encode external knowledge, beyond what is stated in the text of an individual policy; we refer to these as global ontologies. They can be defined by domain experts, merging information from multiple policies, or taking into account the privacy laws (e.g., data ontology derived from the CCPA [2] in Section 2.2).

Third, we present PoliGraph-er, a methodology and implementation that uses NLP to extract and build PoliGraph from the text of a policy. To that end, we have to overcome several technical challenges, including co-reference resolution, list parsing, phrase normalization, and purpose phrase classification, to extract and link more information than previously possible. We discuss these further in Section 3.

Performance and Applications

The proposed PoliGraph framework and PoliGraph-er tool improve over existing policy analyzers and enable new applications.

We evaluate PoliGraph-er on a public dataset from PoliCheck [10, 17], consisting of over 5K policies from over 11K Android mobile apps. Our manual validation shows that PoliGraph-er has over 90% precision in building PoliGraph edges. In terms of coverage, we show that PoliGraph-er can capture and analyze 61% more collection statements than PolicyLint [9]. This improvement is enabled by the PoliGraph representation, which can analyze statements spanning multiple sentences and sections.

PoliGraph enables two new types of automated analysis, which were not previously possible. First, PoliGraph is used to automatically summarize policies in our dataset to reveal common patterns across them. This is possible because PoliGraph represents each policy as a whole, rather than isolated pieces of information; this allows inferences about collection statements. We find that 64% of policies disclose the collection of software identifiers and, in particular, cookies. Advertisers and analytics providers are major entities that collect such data. This is further reinforced by the finding that about half of the policies disclose data usage for non-core purposes, namely for advertising and analytics. However, we also find that the use of generic terms for data types (such as “personal information”) is prevalent and often without more precise definitions. This reduces the transparency and leaves the specific data types being collected unknown. Second, we observe that different policies may have different definitions of the same terms. By clearly separating local ontologies from global ones, PoliGraph allows us to assess the correctness of the definitions of these terms. For example, we find that many policies declare the collected data as “non-personal information”, which contradicts common knowledge and our CCPA-based global data ontology (see Sections 2.2 and 4.3). We also find that non-standard terms are widely used, with varied definitions across different policies.

Furthermore, we apply PoliGraph to revisit two applications previously explored by prior work. First, we extend PoliGraph to include negative statements, and we show that the majority of candidate contradictions between statements in a policy, as identified by prior work, are most likely ambiguous or false alarms. Second, we analyze data flow-to-policy consistency and we find that sensitive data types are, in fact, clearly disclosed across different sentences and sections, in over half of the policies. These are missed by prior work.

Overview

The rest of the paper is structured as follows. Section 2 defines the proposed PoliGraph framework and the ontologies used with it. Section 3 describes the implementation of PoliGraph-er that uses NLP to build PoliGraph from the text of a policy. Section 4 presents the evaluation and applications of PoliGraph in policy analysis. Section 5 discusses related work. Finally, Section 6 concludes the paper.

2 The PoliGraph Framework

In this section, we introduce PoliGraph, our proposed representation of the entire text of a policy as a knowledge graph. We also revisit the related notion of ontologies, and we propose a new definition and use it with PoliGraph.

2.1 Defining PoliGraph

We define PoliGraph as a knowledge graph that captures disclosures in a policy considered as a whole. Throughout this section, we will use Figure 1 as our running example to illustrate the terminology and definitions.

Privacy laws, such as the GDPR [1] and the CCPA [2], require that organizations clearly disclose their practices regarding data collection, sharing, and use in their policies. To capture these three types of disclosures in the policy, we represent the corresponding three kinds of terms, which are also used in prior work [9, 10, 5], in PoliGraph: what data types are collected, with what entities they are shared, and for what purposes they are used.

  • Data type: This kind of terms refers to the type of data being collected. In Figure 1(a), “location” is a specific collected data type. General terms can be used as well, e.g., “personal information” and “device information”.

  • Entity: This kind of terms refers to the organization that receives the collected data. It can be the first party if it is the owner of the system (e.g., website, mobile app, etc.) that writes the policy, namely “we” in Figure 1(a); or, otherwise, a third party such as “travel partners”, also in Figure 1(a).

  • Purpose: Policies may also specify purposes.333In this paper, we refer to purposes of processing of personal data as specified in the GDPR, namely the purposes of collection, use, and sharing. US laws often distinguish among the three, e.g., the CCPA appears to require a policy to disclose the purposes of collecting / using and the purposes of sharing personal information. In Figure 1(a), purposes include “provide services”, “authenticate your account”, and “provide features”.

In PoliGraph, we represent data types and entities as two different types of nodes. Furthermore, we encode the following relations between them as edges.

  • COLLECT edge: An entity may collect a data type . In Figure 1(a), “personal information” is collected by the first-party entity “we”, but it is also shared with the entity “travel partners” (a third party). More formally, a COLLECT edge between an entity and a data type represents that is collected by , namely .

  • SUBSUME edge: A generic term (hypernym) may subsume a more specific term (hyponym). For example, “personal information” subsumes “device information” and “location”, and “device information” in turn subsumes “IP address”. More formally, a SUBSUME edge connects nodes and , where both nodes are data types or both are entities, and it represents that the more generic term subsumes the more specific term , namely .

  • Purposes as edge attributes: We represent purposes by assigning them as a list of attributes to each COLLECT edge . This is a natural choice that fits how policies are written: one or more purposes are typically associated with a data type and an entity. For example, in Figure 1(a), entity “we” (i.e., KAYAK) collects “this information”, which refers to “location”, for the purpose “to provide features”. The purpose “to provide features” is captured in the list of attributes assigned to the COLLECT edge .

In summary, we define PoliGraph, representing all knowledge within a particular policy, as follows.

Definition 2.1.

PoliGraph. A PoliGraph is a directed acyclic graph. Each node represents a term that is either a data type or an entity . Each edge can be either a SUBSUME edge , or a COLLECT edge as defined above.

Furthermore, a list of attributes Purposes can be assigned to each COLLECT edge .

Figure 1(c) shows the PoliGraph representation of the policy text in Figure 1(a). The technical details about building the graph from verbatim text, such as how to map the co-reference term “this information” to “location”, are provided in Section 3. Next, we define relations that can be inferred from PoliGraph about policy text.

Definition 2.2.

Subsumption Relation. In a PoliGraph , we say that a term (hypernym) subsumes another term (hyponym), denoted as , iff there exists a path from to in where every edge is a SUBSUME edge.444A subsumption relation is naturally transitive. To simplify other definitions, we also make it reflexive, i.e., every term subsumes itself.

Definition 2.3.

Collection Relation. In a PoliGraph , we say an entity collects a data type , denoted as , iff there exists an entity and a data type where 555That is, a policy may disclose data collection using general terms. For instance, in Figure 1(c), we have because “IP address” is also “personal information”. and edge exists.

Definition 2.4.

Set of Purposes. Following Definition 2.3, if a purpose , we say collects for the purpose . We denote the set of all instances of such in as a set .

Beyond what is captured by individual nodes, edges, and attributes, the strength of PoliGraph is that it allows us to make inferences. In Figure 1(c), there is no direct edge from “travel partners” to “location”, but we can still infer that “location” may be shared with “travel partners” and “social network services”. Furthermore, we can also infer that (we, location) and (we, location) = . Such data practices that are implied, but not explicitly stated, would be missed by prior work that only processes individual sentences/statements, and possibly by human readers as well.

Prior state-of-the-art work would have extracted a list of tuples, as depicted in the example of Figure 1(b). PolicyLint  [9] and follow-up works [10, 12] extract 2-tuples: entity, data type. Purposes can be extracted independently and appended to form a longer 3-tuple entity, data type, purpose as in OVRseen [6], or put in a nested tuple as in PurPliance [5]. In all cases, those tuples are extracted from individual sentences that are disconnected from each other. As a result, prior work would fail to identify implied practices. In contrast, PoliGraph connects terms with the same semantics in different sentences, allowing inferences and improving coverage.

Another major strength of PoliGraph is that its modular design makes it easy to extend to capture additional relations, if so desired. For example, in future work, one is able to further distinguish among verbs that indicate data collection (e.g., “sell” for profit vs. general collection) by introducing additional types of edges beyond COLLECT, as well as adding other edge attributes (e.g., consent type: opt-in vs. opt-out).

2.2 Ontologies

Policies refer to data types and entities at different semantic granularities. For example, “device information” in Figure 1(a) is a generic data type that subsumes “IP address” and maybe other more specific data types. Prior work [9, 10, 6] has introduced hierarchies of terms, namely ontologies, to define the subsumption relations between data types or entities. They typically define the data and entity ontologies heuristically and manually, by considering a combination of information found in the network traffic and in the policy text, as well as using domain expertise to organize terms into hierarchies.

We revisit the notion of ontologies under the PoliGraph framework. First, PoliGraph naturally captures subsumption relations described in an individual policy, which form the local ontology. Ideally, if a policy is written in a clear and complete way, it should either use specific terms, or clearly define general terms that will be captured by the corresponding local ontology. In practice, policies are not perfectly written and parts of the ontology may be missing. For example, in Figure 1(a), the term “social networking services” is not further explained. Furthermore, some policies may provide misleading definitions, e.g., “geolocation” is described as non-personal information, whereas it is widely considered personal by the public and privacy laws (see Section 4.3). Second, we define and design global ontologies that encode external knowledge or ground truth, as in prior work. For the first time, the distinction between local and global ontologies provides a principled way to summarize an individual policy, as well as to assess the completeness and correctness of definitions by comparing the local against the global ontologies.

2.2.1 Local Ontologies

In PoliGraph, SUBSUME edges between data types or entities induce a directed acyclic graph, which we refer to as a local ontology, capturing the relations between more generic and more specific terms, as defined within a particular policy. We define local data and entity ontologies as follows.

Definition 2.5.

Local Ontology. A local ontology is either a data ontology or an entity ontology , a directed acyclic graph that is a subgraph of PoliGraph , in which every node is a data type or an entity , and every edge (where ) is a SUBSUME edge.

In Figure 1(c), the four blue nodes containing data types form the local data ontology: the root node is “personal information” and the leaf nodes are “location” and “IP address”. The local entity ontology, which contains the three green nodes, does not have a nontrivial hierarchical structure because the policy does not further explain the terms “travel partners” and “social networking services”.

2.2.2 Global Ontologies

We define a global ontology to encode external knowledge, i.e., outside a particular policy, which we consider as ground truth in that context. It provides a reference against which we can compare and evaluate individual policies, as well as a complement to missing definitions in policies.

Definition 2.6.

Global Ontology. A global ontology is either a data ontology or an entity ontology that is a directed acyclic graph, where every node is a data type or an entity , and every edge or is a SUBSUME edge.

Prior work [9, 10, 6] has implicitly and heuristically defined such global ontologies, by taking into account and combining the union of all subsumption relations extracted from policies in their corpus, and the data types and entities observed in the actual system’s output (e.g., network traffic). However, such global ontologies have not been universal: they may include subjective judgement, and they typically do not apply across application domains. For example, PoliCheck’s data ontology does not assume “personal information” to include “device information”: this contradicts the content of the policy depicted in Figure 1(a). Although we recognize that there is no single way to define perfect global ontologies, we propose that we rely on authoritative sources, such as privacy laws, to define them. An example is described next, but other designs can be used with PoliGraph as well.

[width=]figures/data_ontology.pdf

Figure 2: Global Data Ontology based on the CCPA.
Global Data Ontology Based on the CCPA

As a concrete, illustrative example, we propose a global data ontology that is based on the CCPA [2]. The CCPA governs the collection, use, and sharing of personal information, as defined therein, by companies that do business in California. To build the CCPA-based global data ontology, we start with the definition of “personal information” in CCPA Section 1798.140(v)(1), which includes, but is not limited to, specific data types, including a person’s name, social security number, postal address, email address, and customer IP address. We place such specific data types into the ontology as leaf nodes. Then, since policies often disclose the collection of categories of these specific data types, e.g., “contact information” instead of “email address” and “postal address”, we organize these specific data types into categories delineated by subsumption relations. The CCPA’s definition of personal information also includes categories for which it does not list specific data types, e.g., “biometric information”. In such cases, we include the categories in the global data ontology and augment it with common specific data types, e.g., “biometric information” includes “voiceprint” and “fingerprint”. Similarly, the CCPA uses the term “device identifier” but does not define it, while we include it as a category in the global data ontology, and place specific device identifiers in that category. Figure 2 shows the CCPA-based global data ontology. The above is meant as a concrete example of a global ontology based on a privacy law (CCPA). Different privacy laws (e.g., GDPR) can lead to different global ontologies.

Global Entity Ontology

Privacy laws give examples of the types of entities, but not the exhaustive list of entities, with whom an organization may share personal information. For example, CCPA regulations [18], which accompany the CCPA, give examples of such entities: “advertising networks”, “internet service providers”, “data analytics providers”, “government entities”, “operating systems and platforms”, “social network”, “data brokers”, etc. However, policies often categorize entities by service types and use different terms. We obtain a list of entities and their categories from the DuckDuckGo Tracker Radar dataset [19] and a CrunchBase-based dataset [20]

. Based on these sources, containing 4,709 entities in total, we propose a simple two-level ontology that classifies entities into six categories as shown in Figure 

3.

[width=0.50]figures/entity_ontology.pdf

Figure 3: Global Entity Ontology based on [19, 20].

[width=1.85page=1]figures/poligraph-figures.pdf-crop.pdf

Figure 4: Overview of PoliGraph-er implementation. First, PoliGraph-er preprocesses the HTML document to produce a simplified document tree structure. Second, the NLP pipeline takes the document tree and labels sentences with linguistic labels. Third, the labeled sentences are annotated by the annotators to produce a phrase graph containing all the annotations. Finally, the graph building stage deploys term normalization and purpose classification to transform the phrase graph into a PoliGraph.
Extended PoliGraph

When analyzing a policy, it is sometimes desired to merge a PoliGraph, extracted from a single policy, with the global ontologies to form its extended version.

Definition 2.7.

Ext-PoliGraph. An Ext-PoliGraph extends a PoliGraph by merging nodes and SUBSUME edges from the global data ontology and entity ontology , s.t. , , , , and .

This is useful when comparing policies with other sources that use terms at different semantic levels, e.g., when performing analyses on negative statements (see Section 4.4) and flow-to-policy consistency (see Section 4.5).

3 PoliGraph-er Implementation

We present PoliGraph-er, the NLP-based system that we implement to generate PoliGraph from the text of a policy. Figure 4 gives an overview of its implementation.

3.1 NLP on Structured Documents

HTML Preprocessing

Policies are usually published online as structured documents, mainly in HTML format, while NLP models expect plain text input. Simply stripping HTML tags, such as headings and lists, would result in a loss of semantics. Unlike prior work that simply flattened the HTML structure, our HTML preprocessing converts each HTML document to a simplified document tree which preserves three important document structures: heading, list item, and general text. The document tree helps to generate complete sentences as input for NLP. Please see Appendix A for details.

NLP Pipeline

PoliGraph-er utilizes state-of-the-art RoBERTa-based NLP models [21]. Specifically, we build PoliGraph-er based on the spaCy library [22] and use its en_core_web_trf NLP pipeline to label each sentence with linguistic labels originating from English linguistic features, including word lemmas, part-of-speech, sentence segmentation, and syntactic dependencies captured in the dependency tree (see the output of “NLP Pipeline” in Figure 4). These features are syntactic and thus require no domain adaptation.

To identify noun phrases as data types and entities in a policy, PoliGraph-er

uses named entity recognition (NER), a standard NLP technique to classify noun phrases to a given set of labels. In our case, we use two labels:

DATA for data types and ENTITY for entities. We develop a new methodology that combines spaCy’s pretrained NER model, a custom NER model, and rule-based NER for optimal performance. We generate a synthetic corpus for NER model training to avoid the burden of building a dataset. Please see Appendix B for details on PoliGraph-er’s NER methodology.

3.2 Annotators

In PoliGraph-er, we refer to the modules that identify relations between phrases666We use “phrase” to refer to verbatim words and phrases in the policy text. We use “term”, which appears in previous sections as well, to refer to the normalized forms (see Section 3.3) of phrases that appear in PoliGraphs. as annotators. The relations are stored as edges in a graph structure, which we call a phrase graph. The phrase graph is still an intermediate step, in which phrases referring to the same thing have not been merged.

To extract relations between phrases, annotators take advantage of linguistic features labeled by the NLP pipeline to match specific syntactic patterns, especially patterns within the dependency tree [23]. For example, for the sentence “We collect device information…” in Figure 4, the collection annotator identifies the verb “collect”, and traverses down the dependency tree to find the nsubj element (nominal subject) “we” as the entity name and the dobj element (direct object) “device information” as the data type.

By dividing linguistic analysis tasks into five annotators, each of them focuses on a specific set of patterns. Table 1 outlines the patterns and relations that each annotator tries to identify. Note that some edge types (COREF and PURPOSE) exist only in the phrase graph and will be converted in the final PoliGraph. Next, we discuss each annotator in detail.

Annotator Example (based on the policy in Figure 1(a))
Collection Annot.
EntityData
e.g., We collect … personal information
Subsumption Annot.
HypernymHyponym
e.g., Device information… such as IP address
Purpose Annot.
DataPurpose
e.g., We use your personal information … to:
e.g.,Provide the Services
Coreference Annot.
ReferenceMain mention
e.g., We collect … personal information: …
e.g., We use this information to …
List Annot.
Preceding sentenceList item
e.g., TEXT We collect … personal information:
e.g.,     LISTITEM - Device information
e.g.,     LISTITEM - Location
Table 1: Overview of annotators in PoliGraph-er.
Collection Annotator

The collection annotator finds sentences that disclose data collection or sharing practices, extracts corresponding entities and data types, and adds COLLECT edges from entities to data types. It starts by matching verbs that indicate data collection, sharing, or use. Then, it uses additional patterns, depending on the verb, to extract entities and data types from the dependency tree. The annotator matches around 40 verbs and 20 sets of syntactic patterns. Table 5 in Appendix C lists some of the patterns in the active voice. Due to space limitation, we do not list patterns in the passive voice (e.g., “this information is shared with…”) and composite patterns (e.g., “allow us to collect”), but they are all handled by the annotator. We gather these patterns from actual policies in a public dataset which we use to evaluate PoliGraph (see Section 4). Compared to prior work [9, 10, 5] that heuristically extracts entities and data types associated with specific verbs, we create a configurable list of patterns. Thus, we can write the syntactic patterns in a more precise way and capture more diverse patterns than prior work.

The collection annotator also checks negative modifiers (e.g., not, never) in the dependency tree to identify negative sentences. Negative statements are discussed in Section 4.4.

Subsumption Annotator

It identifies subsumption relations between phrases and adds SUBSUME edges from a hypernym to its hyponyms. It matches 11 syntactic patterns of subsumption as shown in Table 6 in Appendix C. These extend the patterns used in prior work [9, 10, 5].

Purpose Annotator

It identifies phrases that describe purposes associated with data collection and links the purpose phrases to corresponding data types with PURPOSE edges. PURPOSE edges are not part of PoliGraph and will be eventually converted into attributes on the corresponding COLLECT edges in PoliGraph. Purpose phrases are not noun phrases, so they are not identified by NER: they are directly matched by syntactic patterns. As in prior work [5], the annotator considers three forms of purpose phrases: (1) in order to verb; (2) to verb; (3) for … purpose(s). For the linked data types, the annotator simply looks for the phrases from which the collection annotator has linked COLLECT edges. For example, in the sentence “We use this information to provide ads”, the annotator identifies the purpose phrase “to provide ads” and links it to the data type “this information”.

Coreference Annotator

Coreferences mean that two or more phrases refer to the same thing, which are common in daily language use, as well as in policies. For example, in Figure 1(a), the phrase “this information” refers to “location”. The two phrases are coreferences, and “location” is considered the main mention, namely the main phrase that the other phrases refer to. Prior work [9, 10, 5] could not handle coreferences properly, which has resulted in a loss of semantics and misinterpretation of many collection statements.

Resolving that “this information” refers to “location” has turned out to be a non-trivial NLP task. We find that well-known coreference resolution models [24] cannot handle non-personal references well, whereas they are commonly found in policies. To address this challenge, we devise our own coreference resolution approach that is specifically optimized for policies. Since we notice that most coreferences in policies follow similar patterns, we use a heuristic-based coreference annotator to resolve them. Coreference phrases are linked to their main mentions with COREF edges, which are part of the phrase graph, but not part of the PoliGraph. They will be used to map coreferences to the referred terms.

The coreference annotator handles three forms of coreferences. First, for a phrase starting with a determiner “this”, “that”, “these”, or “those” (e.g., “these providers”), the annotator looks backward for the nearest phrase with the same root word (e.g., “ad providers”) in the same or previous sentence. Specifically, if the root word is “data” or “information” (e.g., “this information”), the annotator looks backward for any data types as the main mention. Second, for a pronoun like “it”, “this”, “they”, or “these”, the annotator tries to infer whether it refers to a data type or an entity based on existing SUBSUME edges, and looks backward for the nearest data type or entity labeled by NER. Third, phrases in the form “some / all / examples / … of noun phrase” are treated as a special form of coreference, where the prefix word “some / all / examples / …” is resolved into the noun phrase.

List Annotator

This is a special annotator which uses the document tree structure to discover relations. If there is any list item that starts with a dangling noun phrase, for example, “device information” and “location” in Figure 1(a), the annotator checks the preceding sentence before the list, and adds new edges in the following two cases. First, if the preceding sentence has a noun phrase modified by the word “following” or “below” (e.g., “the following personal information”), this noun phrase will be linked to the dangling noun phrase with a SUBSUME edge. Second, if concatenating the preceding sentence and a list item forms a complete sentence, we assume that previous annotators have already annotated any edges in between. If this is the case, i.e., there is an edge from a phrase in the preceding sentence to a list item, the annotator simply duplicates the edge to connect the phrase to each list item.

3.3 From Phrase Graph to PoliGraph

The final step of PoliGraph-er is to build a PoliGraph from a phrase graph.

Normalizing Data Types and Entities

To merge phrases with the same meaning (i.e., synonyms and coreferences) into one node in PoliGraph, PoliGraph-er maps phrases of data types and entities to their normalized forms. This process, known as phrase normalization [25], is necessary for automated analysis, because it allows us to deal with consistent normalized forms instead of many synonymous terms, e.g., “contact information” is the normalized form for “contact details” and “contact info”.

PoliGraph-er uses a combination of NLP and manual rules for phrase normalization. First, it removes stop words and lemmatizes every phrase of data types and entities to obtain the normalized forms. Second, for the terms in our global ontologies (see Section 2.2), which we see as standard terms, we write regular expressions to capture more synonyms and normalize them to the terms used in the ontologies. Notably, for entities, PoliGraph-er normalizes company names using regular expressions automatically built based on public datasets. We leverage the DuckDuckGo Tracker Radar dataset [19] and a CrunchBase-based public dataset [20] to obtain variants of company names. Please see Appendix D for details on the phrase normalization strategies.

For coreferences, the coreference annotator has linked each coreference phrase to its main mention. PoliGraph-er does not apply phrase normalization to phrases with COREF edges. Instead, it follows the COREF edges to find the main phrase and use the same normalized form. For example, in Figure 1(a), “this information” would be normalized to “geolocation”, same as the phrase “location” that it refers to.

Unspecified Data and Unspecified Third Party

Some phrases of data types or entities have very generic meanings, such as “information” and “third party”. Many policies use such phrases independently without further specifying the meaning, making it hard to associate them with any specific data types and entities. For example, in the sentence “we collect information to provide services”, the word “information” can be interpreted as unspecified (or all possible) data types, or some specific data types mentioned around the sentence. PoliGraph-er uses two special nodes, “unspecified data” for data types and “unspecified third party” for entities, to capture such generic terms that are not further specified. The first party, namely “we”, is always specific. Specifically, as a special case of phrase normalization, if a data type or entity is lemmatized into a single word listed in Table 4 in Appendix B, or “third party”, and it is not bounded by subsumption relations, it will be normalized to “unspecified data / third party”. Please see Appendix B for more details.

[width=0.95page=2]figures/poligraph-figures.pdf-crop.pdf

Figure 5: Example of PoliGraph generated from “Puzzle 100 Doors” app’s policy [26].
Classifying Purpose Phrases

PoliGraph-er’s purpose annotator identifies purpose phrases. To allow automated analysis of purpose, we coarsely group purpose phrases into five categories: services, security, legal, advertising, and analytics. These categories are derived from the business purposes defined in the CCPA [2] and partly aligned with prior work [7, 6, 5]. As in [6], we distinguish between core (i.e., services, security, and legal) and non-core (i.e., advertising and analytics) purposes of data collection.

PoliGraph-er uses zero-shot classification [27]

to classify each purpose phrase into one or more categories. Zero-shot classification is a task to classify text into any given set of categories without having to retrain the model with new data. It can be done because the model is trained on a large real-world corpus and is able to estimate semantic similarity between the input text and each label. In particular, we use Facebook’s pre-trained

bart-large-mnli model provided by the HuggingFace library [28]. For example, the purpose phrase “to provide features” is classified as services, whereas the phrase “for advertising purposes” is classified as advertising. Note that a purpose phrase can be classified into multiple labels if it mentions more than one purpose.

Building PoliGraph

Finally, PoliGraph-er builds PoliGraph from the phrase graph by merging phrases with the same normalized form into one node, keeping COLLECT and SUBSUME edges, and inferring the attributes from PURPOSE edges in the phrase graph.

Figure 5 shows an example of a PoliGraph generated from a simple policy [26] for demonstration purposes. A typical PoliGraph from the policies that we have analyzed (see Section 4) can contain up to hundreds of nodes and edges (e.g., [4]). It is common to see vague phrases like “information about activity on Site” and “statistical user data” that are not further clarified, and misleading definitions that include data types that are likely personal and identifiable. However, PoliGraph does capture data collection, along with its purposes, and subsumptions for further analysis.

4 PoliGraph Evaluation and Applications

In Section 4.1, we evaluate the performance of PoliGraph-er. Next, we report novel applications enabled by PoliGraph: Section 4.2 reports policies summarization, which provides inferences on the common patterns across different policies; Section 4.3 looks into how similar terms are defined across different policies. Finally, we show that PoliGraph can improve applications that have been explored in prior work, such as negative statements (Section 4.4) and flow-to-policy consistency analyses [9, 10, 5] (Section 4.5).

The PoliCheck Dataset

Throughout this section, we use the public dataset provided by PoliCheck [10]. We choose this dataset because it is among the largest public datasets for policies. It contains policy URLs for 22,422 Android apps, which is large enough to necessitate automated (not manual) analysis. Furthermore, it also comes with the apps’ network traffic data that facilitate flow-to-policy consistency analysis [10]. We wrote a crawler script based on the Playwright library [29] to download the policy text from each URL.

4.1 PoliGraph-er Performance

We report PoliGraph-er’s performance in analyzing policies and generating PoliGraphs.

Validation of PoliGraph-er’s Results

First, we obtain the most recent version of the policies from August 2022. After excluding non-English, invalid, and duplicated webpages, we obtain 5,518 policies used by 11,403 apps. Using PoliGraph-er, we successfully generate PoliGraphs for 5,039 policies. The remaining policies cannot be processed because they either do not claim to collect data or use irregular HTML tags that cannot be correctly parsed.

Then, we validate PoliGraph-er’s performance in extracting information from the policies in the dataset and building PoliGraphs. We compute the precisions in generating the correct edges, namely 91.8% for COLLECT edges and 94.0% for SUBSUME edges. Thus, PoliGraph-er’s overall precision is comparable to, or better than, prior work: PolicyLint and PurPliance report 82% and 91% precisions, respectively [9, 5]. Next, we evaluate PoliGraph-er’s performance for purpose classification. Overall, the macro-averaged precision and recall are 82.5% and 79.3%, respectively, for this multi-label multi-class classification task.

PoliGraph-er’s Coverage

We also measure the coverage of PoliGraph-er, i.e., how many collection statements can be analyzed, and compare it against the coverage of prior work, namely PolicyLint [9]

, which is open-source.

777Unfortunately, we could not compare PoliGraph-er with other similar systems [11, 5] as they were not available at the time of this evaluation. We use Wayback Machine to obtain 4,729 of the 2019 (historical) version of the policies in the PoliCheck dataset to facilitate a fair comparison with PolicyLint, which was published in 2019. Since PolicyLint extracts tuples (see Figure 1(b)), we convert all nodes and edges with relations in PoliGraphs into the tuples . Together, both tools find a total of 8,581 unique comparable tuples. Among them, 7,797 tuples are found by PoliGraph-er, while 4,840 are found by PolicyLint. Thus, PoliGraph-er extracts 61.1% (2,957 / 4,840) more tuples. When we run PolicyLint on the current version of the policies, its coverage becomes even worse, because it overfits its original set of policies. This aligns with the observation reported in [14] that state-of-the-art policy analysis tools incorrectly reason about more than half of the policies in their dataset. For details on PoliGraph-er’s performance, please see Appendix E.

4.2 Policies Summarization

[width=page=1]figures/result-policy.pdf-crop.pdf

Figure 6: Statistics of COLLECT edges. The numbers of edges with attributes are shown in parentheses. For example, there are 2,894 COLLECT edges from the entity “we” to the data type “personal information”, where 1,726 of them have attributes.

[width=page=2]figures/result-policy.pdf-crop.pdf

Figure 7: Statistics of SUBSUME edges between data types. For example, there are 615 SUBSUME edges that connect a “personal information” node and a “person name” node.

We use PoliGraph to summarize all policies in our corpus and to reveal common patterns across them. While previous tools can also analyze and extract data types, entities, and purposes from a policy, two limitations may have hindered them from being used for this application: (1) the coverage (or recall) was low; and (2) there was no local ontology to capture the definitions of terms within each individual policy.

4.2.1 Characterization of PoliGraph Edges

First, we characterize the COLLECT and SUBSUME edges in the PoliGraphs generated by PoliGraph-er.

Collect Edges

PoliGraph-er extracts 96,720 COLLECT edges from 89,088 sentences in the dataset. Among them, 33,196 edges have attributes from 37,495 purpose phrases. Figure 6 shows the common COLLECT edges found in the dataset. Generic terms, such as “personal information” and “personal identifier”, are commonly used to express data types in policies. Some specific terms, such as “cookie / pixel tag”, “email address”, and “ip address” are also found in many policies. Furthermore, we find that policies disclose data collection by first-party (i.e., “we”) more frequently than by third-party entities. Major third-party entity categories are “advertiser” and “analytic provider”. Google, as the platform, is also frequently mentioned in the policies.

Subsume Edges

PoliGraph-er extracts 66,971 SUBSUME edges from 21,021 sentences in the dataset. Figure 7 shows common SUBSUME edges that connect data type nodes. “Personal information” and “personal identifier” are the most frequently used generic terms to represent data types. Notably, we find that many policies declare the collected data as “non-personal information”: this conflicts with our CCPA-based global data ontology (see Section 4.3).

“Unspecified”

The nodes “unspecified data” and “unspecified third party” (see Section 3.3) are found in 74.0% (3,728) of PoliGraphs. This is because many policies discuss data collection, sharing, and use in separate sections. When they discuss sharing, precise data types are often omitted. When they discuss purposes of use, both data types and entities can be unspecified terms. For example, KAYAK’s policy states: “To protect rights and property, we may disclose your information to third parties” [4]. Without further details on “information” and “third parties”, the statement is captured in PoliGraph as unspecified_third_partyunspecified_data with security as the purpose.

[width=page=3]figures/result-policy.pdf-crop.pdf

Figure 8: Statistics of policies declaring the collection of eight categories of data types. For example, 3,217 policies disclose the collected data types as “software identifier”.

4.2.2 Common Patterns

Next, we present further analyses that facilitate inferences on data types, third-party entities, and purposes that reveal common patterns across policies in our dataset.

Data Types

In this analysis, we use the eight parent nodes of the leaf nodes in the data ontology shown in Figure 2 to group the data types into eight categories: “government identifier”, “contact information”, “software identifier”, “hardware identifier”, “protected classification”, “biometric information”, “geolocation”, and “internet activity”. Figure 8 shows the numbers of policies that disclose the collection of data types in the eight categories. Overall, 77.2% (3,890) of policies disclose the collection of at least one of these data categories.

Finding 1.

The most frequently collected data category is “software identifier”, which mostly originates from “cookie” as the specific data type being collected. 63.8% (3,217) of policies disclose the collection of “software identifier”. Among the specific data types, “cookie / pixel tag” is the most common and found in 84.1% (2,705 / 3,217) of these policies. On the other hand, identifiers specific to mobile apps, mainly “advertising ID” and “Android ID”, are found in only 25.8% (830) and 2.9% (92) of policies, respectively. Many developers simply write one policy for various products, including mobile apps and web-based services. Furthermore, some developers seem to use “cookie” as a generic term for all kinds of device identifiers for tracking.

[width=page=4]figures/result-policy.pdf-crop.pdf

Figure 9: Statistics of policies that disclose the collection of data types (per category) by third-party entities. For example, 745 policies disclose that data types in the “software identifier” category are collected by “advertiser” as the third-party entity.
Third-Party Entities

We use the six parent nodes of the leaf nodes in the entity ontology shown in Figure 3 to group entities into six entity categories: “advertiser”, “analytic provider”, “social media”, “content provider”, “auth provider”, and “email service provider”. Figure 9 reports how each data type is disclosed to be shared with or collected by third-party entities (i.e., relations).

Finding 2.

“Software identifier” is frequently shared with advertisers and analytic providers. Third-party sharing of other data categories (e.g., “geolocation”, “protected classification”, and “internet activity”) is also non-negligible. Advertisers and analytic providers are major third parties with whom an app shares data. We find that in 28.3% (913 / 3,217) of policies, the collection of “software identifier” involves third parties. 12-16% of policies declare that their apps also share data in other categories (e.g., “geolocation”, “protected classification”, and even “internet activity”) with third parties. While data in these categories may be sensitive, it is unclear that they would be needed for the functionality of the app.

Finding 3.

Many policies disclose data sharing using generic terms, e.g., “personal information”. This leads to the inference that the app may share all data types that a generic term subsumes. This often happens when policies disclose data collection, sharing, and use separately. For example, Figure 1 shows that “personal information” is shared with entities such as “social networking services”. This may be alarming to users since “personal information” subsumes sensitive data types, such as “location” and “IP address”. The use of generic terms reduces transparency. Users are left wondering which, if not all, “personal information” is shared. We find that 327 policies declare third-party sharing using generic terms that subsume data types in more than one category.

[width=page=5]figures/result-policy.pdf-crop.pdf

Figure 10: Statistics of policies that disclose the purposes for the collection of data types (per category). For example, 1,722 policies disclose the collection of data types in the “software identifier” category for “services” as the purpose.
Purposes

Figure 10 reports the statistics of policies that disclose purposes of data collection, as discussed in Section 3.3, per data type category.

Finding 4.

46.8% (2,359) policies disclose that their apps collect data for non-core purposes. In over 75% of them (1,833), the main non-core purpose is advertising. We find that, while “software identifier” remains the most common data type (in 1,410 policies) used for non-core purposes, the potential use of other data types for non-core purposes is concerning. For instance, the collection of “geolocation”, “protected classification”, and “internet activity” for non-core purposes is declared in 21-28% of policies. The CCPA [2] defines “government identifiers”, “precise geolocation”, and certain protected classifications as sensitive personal information—the law limits the usage of these data (e.g., they cannot be used for personalized advertising without explicit consent).

4.3 Correct Definitions of Terms

The second novel application enabled by PoliGraph is assessing the correctness of terms and their definitions. In addition to summarizing policies on their own right, we can check whether a policy defines terms in ways that are consistent with external knowledge as captured by the global ontology. This is necessary because policies often provide their own definitions of terms. This is not a problem if the definitions align with external knowledge (e.g., privacy laws in Section 2.2), but it may be misleading if they do not agree with them. For example, many policies define “geolocation” as “non-personal information”. In this section, we focus on terms representing data types that are commonly used in policies and we check whether their definitions in individual policies (as captured by their PoliGraphs’ local ontologies) align with our global data ontology derived from the CCPA (see Figure 2).

Overall, we find such different definitions in 21.0% (1,056 / 5,039) policies in our dataset, as listed in Table 2.

Hypernym Hyponym (# Policies)
non-personal info. geolocation (112), ip address (112), device identifier (97), gender (76), application installed (68), age (67), advertising id (55), imei (20), cookie / pixel tag (20), coarse geolocation (19), android id (19), internet activity (18), mac address (14), date of birth (14)
aggregate/deidentified/ pseudonymized info. ip address (88), device identifier (84), geolocation (74), browsing / search history (14)
contact information gender (12)
internet activity ip address (18)
geolocation ip address (70), router ssid (12), postal address (10)
personal identifier advertising id (76), cookie / pixel tag (52), device identifier (45), age (27), geolocation (25), date of birth (21)
Table 2: Different definitions found in PoliGraphs with respect to the global data ontology. For example, “geolocation” is defined as “non-personal information” in 112 policies.
Finding 5.

Many policies define data types that they collect to be “non-personal”, “aggregated”, “deidentified”, or “pseudonymized”. However, this can be inconsistent with the definitions in the CCPA. Indeed, in the CCPA, “deidentified information” is defined as information that “cannot reasonably identify, relate to, describe … to a particular consumer”. Although entities technically can deidentify personal information, some of the data types we observe in Table 2, notably “geolocation”, “gender”, “age”, and “date of birth”, are generally considered personal information by the public and according to the CCPA. Declaring these data types as non-personal or deidentified can be misleading. For example, Paleblue declares in its policy [30] that “Paleblue may also invite you to share non-personal information about yourself which may include… (1) your age or date of birth; (2) your gender…”.

Finding 6.

Many policies use non-standard terms. They can have broad or varied definitions across different policies. For example, it is not surprising that the definition of “profile information” is application-specific. One policy from the Manager Readme app [31] defines “profile information” to include “name” and “location”, while another policy from Armor Game Inc. [32] defines it to include “gender” or “birthday”. In these cases, the use of non-standard terms is acceptable as the policies clearly explain what they mean by the terms. However, we also find many policies that do not clearly define their non-standard terms. Particularly, while “profile information” is found in 157 policies, subsumption relationships are found in only 17 of them in their corresponding PoliGraphs. Table 3 presents examples of non-standard terms and their possible definitions found in the policies.

Term (# Policies) Possible definitions found in policies
technical info. (238) From 99 policies: advertising id, age, android id, biometric information, cookie / pixel tag, device identifier, fingerprint, geolocation, imei, ip address, mac address, phone number, precise geolocation
profile info. (157) From 17 policies: age, contact information, date of birth, email address, gender, geolocation, person name, phone number, postal address
demographic info.x (282) From 93 policies: age, date of birth, email address, gender, geolocation, internet activity, ip address, personal characteristic, postal address, precise geolocation, race / ethnicity, router ssid
usage data (192) From 56 policies: advertising id, android id, cookie / pixel tag, device identifier, geolocation, ip address, precise geolocation
tracking technology (219) From 44 policies: advertising id, android id, browsing / search history, cookie / pixel tag, device identifier, geolocation, imei, ip address, mac address
Table 3: Examples of non-standard terms found in policies, e.g., “technical information” is used in 238 policies but its detailed definition is only found in 99 policies.

4.4 Negative Statements Analysis

As in prior work [9, 5], we use PoliGraph to conduct negative statements analysis. We extend PoliGraph with NOT_COLLECT edges (see Appendix F for details) and use PoliGraph-er to identify candidate contradictions as defined in [9]. From the 4,729 historical policies, we extract 7,547 pairs of candidate contradictions from 47.9% (2,266) of the policies. We observe that many of the candidate contradictions turn out to be ambiguous or false alarms. We confirm these findings by manually checking many randomly chosen statements containing candidate contradictions, and we identify the following three common cases.

Case 1.

Contradiction caused by the global ontologies: 47.6% (3,592) of candidate contradictions are found because we extend the PoliGraphs into Ext-PoliGraphs with the help of our global ontologies. This indicates that corresponding policies do not literally contradict themselves. For example, SYSAPP TOOLS STUDIO’s policy [33] states: (1) “With your permission, we may collect additional information about the applications installed in the system …”, and (2) “We do not collect your personal information, including information such as your location, name, address …”. The two sentences form a contradiction in Ext-PoliGraph because we set “personal information” to include “applications installed” (see Figure 2). The developer might have a different opinion on what personal information includes. The negative statement, namely statement (2), can be considered misleading, but it does not directly contradict the other statement.

Case 2.

Contradiction caused by limited context: We find that many false alarms are caused by the inability of NLP-based approaches, including PoliGraph, to capture the full context of the sentences. Using regular expressions, we identify that in 58.7% (4,430) of candidate contradictions, the negative statements apply to a specific age group. For example, this kind of disclosure is very common in policies: “We do not collect personal information from minors”. Another common pattern is that the negative statements contain exception clauses, e.g., “We do not collect personal information without your approval”, which accounts for 14.0% (1,053) of candidate contradictions.

Case 3.

Contradiction caused by the inability to understand fine-grained semantic distinctions: Some policies declared that the apps collect and share data, but do not sell data. In 5.9% (449) of candidate contradictions, the negative statements use “sell”, “lease” or similar verbs that indicate data sharing for profit. Unfortunately, prior work did not take such fine-grained semantic distinctions into consideration, and neither does PoliGraph-er since we do not focus on understanding contradictions. We consider it as future work.

4.5 Data Flow-to-Policy Consistency Analysis

Setup

We use the PoliGraph representation of a policy to check its consistency with the actual collection practices of the mobile app, i.e., the data flows data type, entity as observed in the network traffic generated by the app. This application has been previously explored by PoliCheck [10] and subsequent works that apply PoliCheck to the mobile, VR, and voice assistant app ecosystems [12, 5, 6, 13]. For a fair comparison, we use the historical version of the policies in the PoliCheck dataset [10] (see Section 4.1). In addition to policy URLs for apps, this dataset also contains data flows extracted from the apps’ network traffic. We have 9,332 apps with both data flows and policies available. For the purpose of comparison to prior work, we define data flow consistency under the PoliGraph framework and in a way that is compatible with PoliCheck’s data flow (see Definition G.2 in Appendix G).888Please recall that data types and entities are organized in ontologies. In PoliCheck’s terminology, a clear disclosure means that the data flow data type, entity matches exactly the collection statement from the text of the policy; a vague disclosure means that at least one of the data type or entity matches a more general term in their respective ontology. However, ontologies in PoliCheck are global and heuristically defined. Instead, we distinguish between global and local ontologies, where the local ontology captures relations across multiple sentences within the app’s policy. Thus, we can capture more data flows that are consistent with the local ontology, i.e., within the context of their own policy.

Findings

First, we find that, compared to PoliCheck [10], PoliGraph-er captures more clear disclosures, namely statements in the policy that agree exactly with the data flows. For example, the statements in Figure 1(a) clearly disclose that “we” (first party) collects “IP address” and “location”. PoliCheck underestimates the number of clear disclosures for all data types due to its limited coverage, inherited from PolicyLint (see Section 4.1). We find that the collection of “contact information”, including “phone number” and “email address”, is clear in more than 50% of the policies. Second, we adopt the methodology in [6] to extract purposes of data collection. We find that out of 1,079 apps that PoliGraph-er is able to extract purposes from, 731 apps send data for core purposes and 686 apps send data for non-core purposes (see Section 3.3). Please refer to Appendix G for more details.

5 Related Work

Formalizing Policies

A body of related work focuses on standardizing or formalizing policies. W3C P3P standard [34] proposed an XML schema to describe policies. The Contextual Integrity (CI) [35] framework expresses policies as information flows with parameters including the senders, recipients and subjects of information, data types, and transmission principles that describe the contexts of data collection. None of them replaces text-format policies, but they give insights into defining policies and serve as analysis frameworks. PoliGraph builds on the CI framework by extracting entities, data types, and part of the transmission principle (i.e., purposes) from the policy text.

Policy Analysis

Another body of work analyzes policy text. OPP-115 [7] is a policy dataset with manual annotations for fine-grained data practices labeled by experts. Shvartzshnaider et al. [36], with the help of crowdsourced workers, analyze CI information flows extracted from policies to identify writing issues, such as incomplete context and vagueness. This manual approach is difficult to scale up for hundreds or thousands of policies due to the significant human efforts.

Automated Policy Analysis

The progress in NLP has made it possible to automate the analysis of unstructured text, such as policy text. Privee [37] uses binary text classifiers to answer whether a policy specifies certain privacy practices, such as data collection, encryption and ad tracking. Polisis [11], trained on the OPP-115 dataset, uses 10 multi-label text classifiers to identify data practices, such as the category of data types being discussed and purposes. Classifier-based methods use pre-defined labels which cannot capture the finer-grained semantics in the text. PolicyLint [9] first uses NLP linguistic analysis to extract data types and entities in collection statements. PurPliance [5], built on top of PolicyLint, further extracts purposes. Conceptually, both works focus on analyzing one sentence at a time, and extracting a tuple entity, collect, data type, as well as purpose in PurPliance, albeit in a separate, nested tuple data type, for / not_for, entity, purpose. Unlike PoliGraph, these works are limited in that they view extracted tuples individually and cannot infer data practices disclosed across multiple sentences.

Knowledge Graphs

Graphs are routinely used to integrate knowledge bases as relationships between terms [15]. Google has used a knowledge graph built from crawled data to show suggestions in search results [38]. OpenIE [39] and T2KG [16] use NLP to build knowledge graphs from a large corpus of unstructured text. In PoliGraph, we use knowledge graphs, for the first time, to represent policies.

6 Conclusion

Summary

In this paper, we present PoliGraph, a framework that represents a policy in a knowledge graph that (1) connects information of data collection practices across different parts of the policy text, and (2) clearly distinguishes between local and global ontologies. We show that PoliGraph facilitates new policy analyses, such as policies summarization, and analysis on definitions of terms. We also empirically evaluate the effectiveness of PoliGraph-er w.r.t. state-of-the-art tools, such as PolicyLint and PoliCheck, and show that PoliGraph-er generates PoliGraphs that facilitate much more effective policy analysis. We plan to make PoliGraph’s source code and dataset publicly available.

Future Directions

A strength of PoliGraph is that it can be easily extended to analyze additional aspects of a policy and incorporate them into the knowledge graph. For example, one can differentiate among data collection, sharing vs. selling, and use of information; and encode them using new types of edges. We can also encode different purposes according to the type of actions (collection vs. usage purposes) as edge attributes. One can also define additional global ontologies based on privacy laws beyond the CCPA, such as the GDPR. Ultimately, we envision that PoliGraph’s global ontologies and data structure template can be proactively defined by policymakers, and implemented by systems and their policies.

References

  • [1] Publications Office of the European Union, “General Data Protection Regulation (GDPR).” https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:02016R0679-20160504, 2016.
  • [2] State of California Department of Justice, “California Consumer Privacy Act of 2018 (amended by the California Privacy Rights Act of 2020).” https://leginfo.legislature.ca.gov/faces/codes_displayText.xhtml?lawCode=CIV&division=3.&title=1.81.5.&part=4., 2020.
  • [3] S. Jordan, S. Narasimhan, and J. Hong, “Deficiencies in the disclosures of privacy policies,” in Proceedings of the 49th Research Conference on Communication, Information and Internet Policy (TPRC), 2021.
  • [4] KAYAK, “KAYAK - Privacy Policy.” https://www.kayak.com/privacy, July 2022.
  • [5] D. Bui, Y. Yao, K. G. Shin, J.-M. Choi, and J. Shin, “Consistency analysis of data-usage purposes in mobile apps,” in Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS ’21, Association for Computing Machinery, 2021.
  • [6] R. Trimananda, H. Le, H. Cui, J. T. Ho, A. Shuba, and A. Markopoulou, “OVRseen: Auditing network traffic and privacy policies in oculus VR,” in 31st USENIX Security Symposium (USENIX Security 22), USENIX Association, Aug. 2022.
  • [7] S. Wilson, F. Schaub, A. A. Dara, F. Liu, S. Cherivirala, P. G. Leon, M. S. Andersen, S. Zimmeck, K. M. Sathyendra, N. C. Russell, et al., “The creation and analysis of a website privacy policy corpus,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Aug. 2016.
  • [8] S. Wilson, F. Schaub, R. Ramanath, N. Sadeh, F. Liu, N. A. Smith, and F. Liu, “Crowdsourcing annotations for websites’ privacy policies: Can it really work?,” in Proceedings of the 25th International Conference on World Wide Web, WWW ’16, Apr. 2016.
  • [9] B. Andow, S. Y. Mahmud, W. Wang, J. Whitaker, W. Enck, B. Reaves, K. Singh, and T. Xie, “PolicyLint: Investigating internal privacy policy contradictions on google play,” in 28th USENIX Security Symposium (USENIX Security 19), USENIX Association, Aug. 2019.
  • [10] B. Andow, S. Y. Mahmud, J. Whitaker, W. Enck, B. Reaves, K. Singh, and S. Egelman, “Actions speak louder than words: Entity-Sensitive privacy policy and data flow analysis with PoliCheck,” in 29th USENIX Security Symposium (USENIX Security 20), USENIX Association, Aug. 2020.
  • [11]

    H. Harkous, K. Fawaz, R. Lebret, F. Schaub, K. G. Shin, and K. Aberer, “Polisis: Automated analysis and presentation of privacy policies using deep learning,” in

    27th USENIX Security Symposium (USENIX Security 18), USENIX Association, Aug. 2018.
  • [12] C. Lentzsch, S. J. Shah, B. Andow, M. Degeling, A. Das, and W. Enck, “Hey alexa, is this skill safe?: Taking a closer look at the alexa skill ecosystem,” in Network and Distributed Systems Security (NDSS) Symposium 2021, Feb. 2021.
  • [13] U. Iqbal, P. N. Bahrami, R. Trimananda, H. Cui, A. Gamero-Garrido, D. Dubois, D. Choffnes, A. Markopoulou, F. Roesner, and Z. Shafiq, “Your echos are heard: Tracking, profiling, and ad targeting in the amazon smart speaker ecosystem,” arXiv preprint arXiv:2204.10920, 2022.
  • [14] S. Manandhar, K. Kafle, B. Andow, K. Singh, and A. Nadkarni, “Smart home privacy policies demystified: A study of availability, content, and coverage,” in 31st USENIX Security Symposium (USENIX Security 22), USENIX Association, Aug. 2022.
  • [15] G. A. Miller, “Wordnet: A lexical database for english,” Communications of the ACM, vol. 38, p. 39–41, Nov. 1995.
  • [16] N. Kertkeidkachorn and R. Ichise, “T2kg: A demonstration of knowledge graph population from text and its challenges,” in Workshop and Poster Proceedings of the 8th Joint International Semantic Technology Conference, 2018.
  • [17] B. Andow, “benandow/PrivacyPolicyAnalysis: The code for PolicyLint and PoliCheck.” https://github.com/benandow/PrivacyPolicyAnalysis, 2020.
  • [18] State of California Department of Justice, “CCPA Regulations.” https://oag.ca.gov/privacy/ccpa/regs, 2018.
  • [19] Duck Duck Go, Inc., “duckduckgo/tracker-radar: Data set of top third party web domains with rich metadata about them.” https://github.com/duckduckgo/tracker-radar, 2022.
  • [20] Peter Tripp (@notpeter), “notpeter/crunchbase-data: 2015 CrunchBase Data Export as CSV.” https://github.com/notpeter/crunchbase-data, 2016.
  • [21] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  • [22] M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd, “spaCy: Industrial-strength Natural Language Processing in Python.” https://spacy.io/, 2020.
  • [23] M.-C. de Marneffe, T. Dozat, N. Silveira, K. Haverinen, F. Ginter, J. Nivre, and C. D. Manning, “Universal Stanford dependencies: A cross-linguistic typology,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), European Language Resources Association (ELRA), May 2014.
  • [24] R. P. Hudson, “msg-systems/coreferee: Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages.” https://github.com/msg-systems/coreferee, 2022.
  • [25] A. T. Arampatzis, T. van der Weide, P. van Bommel, and C. Koster, “Linguistically-motivated information retrieval,” in Technical Report CSI-R9918, Sept. 1999.
  • [26] Pixel Tale Games, ““Puzzle 100 Doors” Privacy Policy.” https://proteygames.github.io/, July 2022.
  • [27] W. Yin, J. Hay, and D. Roth, “Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Nov. 2019.
  • [28] Hugging Face, Inc., “Zero Shot Topic Classification.” https://huggingface.co/zero-shot/, 2022.
  • [29] Microsoft, “Fast and reliable end-to-end testing for modern web apps | Playwright.” https://playwright.dev/, 2022.
  • [30] Paleblue Corp., “Paleblue Corp. Privacy Policy.” http://nalst.cafe24.com/policy.html, 2015.
  • [31] O. Ellenbogen, “Manager Readme Privacy Policy.” https://managerreadme.com/privacy, 2022.
  • [32] Armor Games Inc., “Armor Games Inc. Privacy Policy.” https://armorgamesstudios.com/privacy/, Jan. 2020.
  • [33] SYSAPP TOOLS STUDIO, “Privacy Policy – SYS APP TOOLS STUDIO.” https://sysapptools.wordpress.com/privacy-policy/, Mar. 2016.
  • [34] World Wide Web Consortium and others, “The Platform for Privacy Preferences 1.1 (P3P1.1) Specification.” https://www.w3.org/TR/2018/NOTE-P3P11-20180830/, 2018.
  • [35] H. Nissenbaum, Privacy in Context - Technology, Policy, and the Integrity of Social Life. Stanford University Press, 2009.
  • [36] Y. Shvartzshnaider, N. Apthorpe, N. Feamster, and H. Nissenbaum, “Going against the (appropriate) flow: A contextual integrity approach to privacy policy analysis,” in Proceedings of the AAAI Conference on Human Computation and Crowdsourcing

    , Association for the Advancement of Artificial Intelligence, Oct. 2019.

  • [37] S. Zimmeck and S. M. Bellovin, “Privee: An architecture for automatically analyzing web privacy policies,” in 23rd USENIX Security Symposium (USENIX Security 14), USENIX Association, Aug. 2014.
  • [38] Google, “Introducing the Knowledge Graph: things, not strings.” https://blog.google/products/search/introducing-knowledge-graph-things-not/, 2012.
  • [39] G. Angeli, M. J. J. Premkumar, and C. D. Manning, “Leveraging linguistic structure for open domain information extraction,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015.
  • [40] Arc90 Inc and Mozilla, “mozilla/readability: A standalone version of the readability lib.” https://github.com/mozilla/readability, 2022.
  • [41] Wikidata contributors, “Wikidata.” https://www.wikidata.org/, 2022.
  • [42] D. Bui, Y. Yao, K. G. Shin, J.-M. Choi, and J. Shin, “ducalpha/PurPlianceOpenSource: Source code of PurPliance analysis tool.” https://github.com/ducalpha/PurPlianceOpenSource, 2022.
  • [43] H. Jin, M. Liu, K. Dodhia, Y. Li, G. Srivastava, M. Fredrikson, Y. Agarwal, and J. I. Hong, “Why are they collecting my data? inferring the purposes of network traffic in mobile apps,” in Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Association for Computing Machinery, Dec. 2018.

Appendix A HTML Preprocessing

In Section 3.1, we explain that PoliGraph-er converts each policy into a simplified document tree structure for subsequent NLP tasks. We provide more details in this appendix.

Document Tree

Given a policy in the HTML format, PoliGraph-er starts by converting it into a document tree, a simplified HTML DOM tree. Each node in the document tree corresponds to a fragment of text, referred to as a segment. PoliGraph-er uses three kinds of segments: (1) A heading segment corresponding to an HTML heading, whose parent, if exists, is the higher level heading segment; (2) A list-item segment corresponding to an item in an HTML list, whose parent is the text or heading segment immediately before the list; (3) A text segment corresponding to a general HTML text container that is neither a heading nor a list item, whose parent is the heading closest to it.

A hypothetical document tree, modified from the policy example in Figure 1(a), looks like this:

Example A.1.

Document Tree
 HEADING
Data Collection
   TEXT We collect the following personal information:
     LISTITEM - Device information, such as IP address…
     LISTITEM - Location
   TEXT We disclose the personal information as follows…

Technically, PoliGraph-er first uses Readability.js library [40] to extract the main content of a webpage (i.e., without sidebar, footer, and other unrelated widgets). Then it parses the HTML markups to generate the document tree. In some cases, webpages use plain text lists (e.g., prefixing each paragraph with numbers) instead of HTML lists. PoliGraph-er has a fallback mechanism to identify such lists and convert them into list segments.

Text Concatenation

NLP pipeline expects text input. And in order for it to produce correct results, we need to input complete sentences. Some segments, like the two list items in Example A.1, are not self-contained to form full sentences. Each of them has to be concatenated after its parent text segment to form a complete sentence, for example, “We collect the following personal information: Name”. To do this, PoliGraph-er leverages the document tree to find the context of each segment.

In reality, nested headings and lists are prevalent, and many policies are not written in a way that sentence borders can be easily determined. It is hard to determine at which level a complete sentence starts without another NLP model. To work around the issue, PoliGraph-er iteratively concatenates each segment with its parents up to different levels to get multiple versions of input text. For each LISTITEM in Example A.1, PoliGraph-er generates three versions of the text for NLP: (1) LISTITEM; (2) TEXT+LISTITEM; and (3) HEADING+TEXT+LISTITEM. NLP pipeline gets valid input as long as any of these versions aligns with sentence borders.

NLP Pipeline

PoliGraph-er uses spaCy library [22] and its en_core_web_trf NLP pipeline, which is a set of NLP models built upon the RoBERTa architecture [21], to annotate common linguistic features. These linguistic features, including word lemmas, part-of-speech, sentence segmentation, and syntactic dependencies, are syntactic and, thus, they require no domain adaption. The RoBERTa-based pipeline achieves state-of-the-art performance on these tasks. We only train our own NER model for identifying data types and entities (see Appendix B). PoliGraph-er relies on these linguistic features to build PoliGraphs. Although spaCy’s NLP pipeline processes plain text and does not keep the document tree structure, the analyzer internally records which segment each word comes from. This way, the analyzer can revert to the document tree whenever needed.

Appendix B Identifying Data Types and Entities

In Section 3.1, we explain that PoliGraph-er uses named entity recognition (NER) to identify data types (labeled as DATA) and entities (labeled as ENTITY) in the policy text. We provide details on our NER methodology in this appendix.

Pretrained NER Model

We use the standard NER model in spaCy’s NLP pipeline to identify company, person and product names, which are mapped to the ENTITY label. However, the model does not recognize data types or generic entity categories like “advertising provider”. To adapt the existing spaCy’s NER model, PoliGraph-er adds another custom NER model on top of it.

Custom NER Model

Training a new NER model requires a labeled corpus. To avoid the burden of building a dataset from scratch, we programmatically generate a synthetic corpus.

We remove data types and entities from real sentences in the policy text to generate 110 sentence templates. We obtain an entity phrase list of about 12K names by crawling Wikidata [41] items that are related to companies or products (including generic terms like category names of entities). We manually write a phrase list of about 250 data types. The training corpus is generated by filling in the templates with random entities and data types in the phrase lists. The corpus generation script applies a few rules to transform data type phrases to increase the diversity of phrases. For example, there are “timezone setting”, “language setting”, and other “… setting”s in the phrase list. The script has a rule to randomly replace the word “setting” with its synonyms “preference” or “configuration” to generate more phrases.

We train the custom NER model with the same configuration as the one in the en_core_web_trf pipeline, which is RoBERTa-based and archives state-of-the-art performance.999

The precision and recall are both above 99% in our validation. This is not indicative of the real performance due to the limited diversity of the synthetic data.

NER label Root words of noun phrases
DATA
identifier, address, preference, number, datum, data, setting, information
ENTITY
party, service, operator, corporation, broker, processor, publisher, analytic, platform, app, analytics, carrier, organization, business, product, advertiser, software, vendor, provider, site, affiliate, application, distributor, partner, website, subsidiary, company, network
Table 4: Root words used for rule-based NER
Rule-based NER

In addition to neural NER models, PoliGraph-er also leverages some rules or heuristics to label noun phrases that clearly refer to data types or entities.

First, PoliGraph-er searches for noun phrases with words in Table 4 as their root words and assigns them with corresponding labels. The root word of a phrase is determined based on its syntactic dependency tree so this is not just string matching. For example, the phrase “service information” has the root word “information” and is labeled as DATA, while the phrase “information service(s)” has the root word “service“ and is labeled as ENTITY.

Second, PoliGraph-er assumes that coordinating noun phrases (or conjuncts) have the same NER label. For example, in the sentence “We collect Android ID and IDFV”, if the NER model labels “Android ID” as DATA, but misses “IDFV”, this rule will copy the label to “IDFV”.

Appendix C Syntactic Patterns

In Section 3.2, we explain that PoliGraph-er’s collection annotator depends on a set of syntactic patterns to match and extract entities and data types from the dependency tree. Table 5 shows a representative set of examples of the syntactic patterns in the active voice.

Furthermore, PoliGraph-er’s subsumption annotator also relies on a set of patterns to identify and extract subsumption relations. Table 6 shows such syntactic patterns.

Root Verbs
(Examples: ENTITYDATA)
Syntatic Patterns
share, trade, exchange
(We share your device IDs with Google.)
ENTITY:nsubj
DATA:dobj
with,ENTITY:pobj
collect, gather, obtain, get, receive, solicit, acquire
(Google may collect your device IDs.)
ENTITY:nsubj
DATA:dobj
provide, supply
(We provide Google with your device IDs.)
ENTITY:nsubj
ENTITY:dobj
with,DATA:pobj
provide, supply, release, disclose, transfer, transmit,
sell, give, pass, divulge
(We may transmit device IDs to Google.)
ENTITY:nsubj
DATA:dobj
to,ENTITY:pobj
use, keep, access, analyze, process, store, save, log,
utilize, record, retain, preserve, need
(Google may use your device IDs.)
ENTITY:nsubj
DATA:dobj
have, get
(Google has access to your device IDs.)
ENTITY:nsubj
access,to,DATA:pobj
make
(Google makes use of device IDs.)
ENTITY:nsubj
use:dobj
of,DATA:pobj
enable, allow, permit, authorize, ask, require, permit
(This enables Google to collect your device IDs.)
(You authorize Google to collect your device IDs.)
(compounded with
above patterns)
Table 5: Syntactic patterns (in the active voice) used by the collection annotator.
Phrases Sentences
such as includes
such as includes but is not limited to
, for example,
, e.g. / i.e.
, which includes
including / like
, especially / particularly,
, including but not limited to,
(collectively )
= hypernym phrase; = hyponym phrases.
Table 6: Syntactic patterns used by the subsumption annotator.

Appendix D Phrase Normalization

In Section 3.3, we explain that PoliGraph-er performs phrase normalization using a few strategies. We provide more details on these strategies in this appendix.

Data types and entities are both noun phrases. PoliGraph-er uses three strategies to normalize noun phrases.

First, for all the noun phrases, PoliGraph-er builds its normalized form by lemmatizing every word and removing stop words from the phrase. The stop word list includes typical English stop words and certain unimportant adjectives, such as “other”, “certain”, and “various”, in the context of policies. For example, “your account details” and “other account details” are both normalized into “account detail”.

Second, for a set of standard terms listed in Table 7, we manually write regular expressions to capture more synonyms. These terms are from the global data and entity ontologies (see Section 2.2). For example, the regular expression contact{̱.}*{̱(}information|data|detail|method) maps “contact data”, “contact details”, “contact method”, and other synonyms into “contact information”, which is considered as the standard and normalized term.

To normalize company names, we build regular expressions from public datasets. We use the DuckDuckGo Tracker Radar dataset [19] and a CrunchBase-based public dataset [20] to obtain variants of company names, e.g.,

“Alphabet Inc.” (group company) and “Firebase” (product) as alternative names for Google (the normalized term). We extract n-grams that are uniquely found in each company’s list of alternative names to build regular expressions for the company. For example, two-grams “alphabet inc” and “google inc”, and one-grams “firebase” and “google” are all normalized to “Google”.

Third, as a special case of the first strategy, if a phrase is lemmatized into a single word (see Table 4), or “third party”, it is handled differently. These words have context-dependent meanings and we do not assume a consistent semantic across a policy. In this case, if the phrase is bounded by SUBSUME edges, e.g., “information” in “we collect information such as your name”, it is mapped to a unique numbered node (e.g., “information_18”) to prevent it from being merged with other “information” and maintain the subsumption relationship. Otherwise, if the phrase is not bounded, e.g., “we collect information to provide ads”, it is mapped to a special node “unspecified data” (or “unspecified third party” for entities) since “information” in this phrase is not further specified by the policy. For entities, the counterpart is “unspecified third party”, when the policy refers to entities that collect data vaguely as “third party”, “partners”, or similar words. For example, in “we share the information we collect with third parties”, “third parties” is mapped to “unspecified third party”.

Type Terms
Data types
personal information, personal identifier, identifier, government identifier, SSN, passport number, driver’s license number, age, gender, race / ethnicity, date of birth, personal characteristics, protected classification, biometric information, voice print, fingerprint, contact information, person name, phone number, email address, postal address, device identifier, software identifier, hardware identifier, IMEI, MAC address, advertising ID, Android ID, SIM serial number, router SSID, IP address, cookie / pixel tag, internet activity, browsing / search history, package dump, geolocation, precise geolocation, coarse geolocation, non-personal information, aggregate / deidentified / pseudonymized information
Entities
we (first party), advertiser, auth provider, analytic provider, social media, content provider (and entities in the Tracker Radar and Crunchbase datasets).
Table 7: Standard terms for noun phrase normalization. Synonymous terms in policies are normalized into a standard term, e.g., “contact details” and “contact info” are normalized into “contact information”.

Appendix E PoliGraph-er Performance in Detail

In Section 4.1, we report our validation of PoliGraph-er’s performance. We provide more details in this appendix.

Validation of PoliGraph-er’s results

First, we manually evaluate whether PoliGraph-er extracts the correct edges from the PoliCheck dataset as follows. To evaluate the precision of PoliGraph edges, we sample five edges from each of 100 randomly selected PoliGraphs in the dataset and manually read the corresponding policy text to validate whether each edge is correctly extracted from the text. To help with this evaluation, PoliGraph-er stores the sentences from which each edge is generated. To evaluate the recall of PoliGraph edges, we randomly select 100 policies to read and manually extract five edges from each of them, and check the corresponding PoliGraphs to see whether each relationship is captured by PoliGraph-er. We find that the precision and recall for COLLECT edges are 91.8% and 62.4%, the precision and recall for SUBSUME edges are 94.0% and 66.2%. Thus, PoliGraph-er’s overall precision is comparable to, or better than, prior work [11, 9, 5]. PolicyLint and PurPliance report 82% and 91% precisions, respectively [9, 5].

Next, we evaluate PoliGraph-er’s purpose classification We select five purpose phrases from each of 100 randomly selected policies, manually assign labels to the phrases, and compare them to the ones generated by PoliGraph-er. Overall, the macro-averaged precision and recall are 82.5% and 79.3%, respectively, for this multi-label multi-class classification task.101010 The precision is not as high as that of PurPliance’s purpose classifier [5], which was specifically tailored to policies, whereas we use the zero-shot classification based on a generic language model [28]. One advantage is that the model works reasonably well without requiring further training.

PoliGraph-er’s coverage

We compare PoliGraph-er’s coverage for policy analysis, namely how much of the collection statements in a policy can be analyzed by our tool, against state-of-the-art work, i.e., PolicyLint [9].111111We contacted the authors of PurPliance, but its source code was still in the release process [42], whereas Polisis is offered as a web-based service that is not open-sourced and often not available due to server issues. Both tools were not available when we performed this comparison (around September 2022). Since we cannot obtain their source code, it would be unfeasible to reason about and compare the tools with PoliGraph-er. For a fair comparison between PoliGraph-er and PolicyLint, we use the historical version (around January 1, 2019) of the policies obtained from Wayback Machine, so as to match the publication date of PolicyLint [9] as closely as possible. We are able to obtain 4,729 unique policies in total.

We convert all pairs of relations in PoliGraphs into PolicyLint tuples n, collect, d. We cannot compare terms from the two tools directly because they use different phrase normalization techniques that result in different normalized forms. To work around the issue, we select a subset of terms to compare. For data types, we only consider the following precise data types in PolicyLint, since they are comparable to the same data types extracted by PoliGraph-er: “mac address”, “router ssid”, “android id”, “gsf id”, “sim serial number”, “serial number”, “imei”, “advertising identifier”, “email address”, “phone number”, “person name”, and “geographical location”.121212Note that we map “coarse geolocation”, “precise geolocation”, and “geolocation” in PoliGraph all to “geographical location” in the tuple because PolicyLint do not distinguish between them. For entities, we only distinguish between first-party and third-party, i.e., all tuples are converted to either we, collect, data type or third-party, collect, data type; “unspecified third party” is considered a third-party.

Both tools find 8,581 unique comparable tuples. Among them, 90.9% (7,797) are found by PoliGraph-er, while 56.4% (4,840) are found by PolicyLint. Thus, although recall has not been reported in previous work, based on the 62.4% recall of PoliGraph COLLECT edges, we roughly estimate that the recall of PolicyLint is less than 40%. While, in general, many policies are written in a non-standard way which poses a great challenge for NLP analysis, we argue that PoliGraph gives a much better coverage and recall due to its inherent graph structure that allows it to infer more collection statements when looking into multiple sentences in a policy, e.g., via coreference resolution.

When we run PolicyLint on the current version of the policies, its coverage becomes even worse. Upon investigating this issue further, we realize that PolicyLint’s phrase normalization (i.e., synonym list [9]) overfits the original version of policies used by the authors and does not generalize across different versions due to subtle changes in the text. This aligns with the observation made by Manandar et al. that state-of-the-art policy analysis tools incorrectly reason about more than half of the policies in their dataset [14].

Appendix F Negative Statements Analysis and Candidate Contradictions

In Section 4.4, we report our findings on negative statements analysis. We provide more details in this appendix.

By default, and unlike prior work [9, 5], we choose to not analyze negative statements, for two reasons. First, privacy laws do not require policies to disclose what is not collected. Second, we find that the semantics of negative statements cannot be reliably determined through NLP-based linguistic analysis. While contradictions are discussed in previous works [9, 10, 5], our findings show that the majority of the previously reported contradictions could be ambiguous or false alarms. Specific to this section, to demonstrate that, we extend PoliGraph-er to analyze negative statements and integrate them into the generated PoliGraph. To facilitate this, we add NOT_COLLECT edges and not_collect(e, d) relations to represent negative statements in PoliGraph. Next, we define a candidate contradiction, which is compatible with the same concept used in PolicyLint [9], in a policy as follows.

Definition F.1.

Candidate Contradiction. Let Ext-PoliGraph contain data types , and entities and (). We say that two relations collect(, ) and not_collect(, ) form a candidate contradiction in iff both data types have a subsumption relation, i.e., subsume(, )subsume(, ), and both entities also have a subsumption relation, i.e., subsume(, )subsume(, ).

Based on Ext-PoliGraphs131313Definition F.1 aligns with the contradictions defined in [9]. As PolicyLint always uses its global ontologies to analyze contradictions, analyzing contradictions in Ext-PoliGraph better facilitates the comparison. of the 4,729 historical policies, we extract 7,547 pairs of candidate contradictions from 47.9% (2,266) of the policies. The percentage is higher than previously reported in [9] because PoliGraph-er captures more statements, and we do not try to provide a workaround using heuristics (e.g., regular expressions) to reduce false alarms as in PolicyLint [9]. Many of these candidate contradictions turn out to be ambiguous contradictions or false alarms as reported in Section 4.4.

Appendix G Data Flow-to-Policy Consistency Definitions and Detailed Results

In Section 4.5, we report the summary of our results on consistency between data flows and collection statements. Due to a lack of space, details are deferred to this appendix.

In this analysis, we use the PoliGraph representation of a policy to check its consistency with the actual data collection practices of the mobile app, as observed in the network traffic it generates. This application has been previously explored by PoliCheck for mobile apps [10] and other works that adapt PoliCheck in different app ecosystems and platforms, e.g., smart speakers [12, 13] and VR headsets [6]. As in prior work [10, 5], we define the concepts of data flow and its consistency to policy text under PoliGraph framework.

Definition G.1.

Data Flow. A data flow is a tuple where is the data type that is sent to an entity .

Given a PoliGraph (or Ext-PoliGraph), we check whether the data flow is disclosed in it.

Definition G.2.

Consistent Data Flow. Following Definitions 2.1 and 2.7, is a PoliGraph and is its respective Ext-PoliGraph. Data flow is consistent with the collection statement made in the policy represented by PoliGraph or Ext-PoliGraph iff it contains entity (where or ) and data type (where or ), and is true in or .

To facilitate a comparison with PoliCheck, we use the same dataset from it [10] (see Section 4.1). In addition to policy URLs for apps, this dataset also contains data flows extracted from the apps’ network traffic. Note that to perform the flow-to-policy consistency analysis, PoliCheck compares these data flows against the tuples extracted from the policy text (see Figure 1(b)). Using PoliGraph, we compare the data flows against similar data collection information captured in the relationships. To also facilitate a fair comparison with PoliCheck, we use the historical version of policies in the dataset (see Appendix E). In total, we have 9,332 apps with both data flows and policies available.

We note that a consistent data flow in PoliGraph corresponds to clear disclosure in PoliCheck’s terminology, since both mean that the policy clearly declares the collection of the data type by the entity within the policy text. Similarly, a consistent data flow in Ext-PoliGraph is similar to PoliCheck’s vague disclosure as both indicate the use of external knowledge, namely the global ontologies, to complement the definitions of generic terms.

[width=page=1]figures/result-network.pdf-crop.pdf

Figure 11: Flow-to-policy consistency comparison of PoliGraph vs. PoliCheck. The numbers in parentheses represent the number of apps that collect the data type, e.g., “phone number (85)” means 85 apps collect “phone number”.

Figure 11 compares flow-to-policy consistency results per data type in PoliGraph and PoliCheck. An app may send one data type to multiple entities, resulting in multiple data flows per data type. In this case, we report the worst disclosure type for the app, e.g., if, at least, one of the data flows of the data type is inconsistent, the disclosure type is reported as inconsistent. We present the results for nine data types—three out of 12 data types are analyzed in [10], but excluded here as only less than 10 apps exhibit the three in their data flows.

Clear disclosures

We find that PoliGraph-er performs better than PoliCheck in terms of capturing clear disclosures. PoliCheck underestimates the number of clear disclosures for all data types due to its limited recall. We find that the collection of contact information, including “phone number” and “email address”, is clear in more than 50% of the apps.

Vague disclosures

Here, PoliGraph-er extracts fewer vague disclosures than PoliCheck. Further investigation reveals that this is because our global data ontology has a different design compared to PoliCheck’s. In PoliCheck’s data ontology, “personal information”, a commonly seen term in policies, subsumes “device identifiers”. While this would effectively increase PoliCheck’s coverage, namely that the collection of all data types related to “device identifiers” found in the data flows would be categorized as vague disclosures, it is unclear that “device identifiers” can be strictly categorized as “personal information”. Many policies do not consider “device identifiers” as “personal information”.

[width=page=2]figures/result-network.pdf-crop.pdf

Figure 12: Purposes inferred from Ext-PoliGraphs. Some apps disclose the collection of the data types for core or non-core functionality, or both. We ignore apps whose policies do not disclose purposes, so the total number of apps (listed in parentheses) is less than the number of apps in Figure 11.
Purposes

Inferring the purposes from the network traffic would be very limited to only a few purposes [43]141414For example, while advertising as a purpose can be inferred based on a network packet’s destination entity, it would not be straightforward to infer legal as a purpose., and do not often align well with the definitions given in the policies [5]. Thus, instead of directly checking the consistency between the purposes extracted from the collection statements in the policies and the data flows in the network traffic as in PurPliance [5], we adopt the methodology used in [6] that maps each data flow into its corresponding collection statement that has a purpose. This is captured in the corresponding PoliGraph as a COLLECT edge assigned with a list of attributes that contain purposes (see Section 2).

We are able to extract purposes for data flows of 1,079 apps based on PoliGraphs of their policies. We classify the data flows into the five categories of purposes discussed in Section 3.3, and further group them into core and non-core purposes. Services, security, and legal are grouped as core purposes; whereas analytics and advertising are grouped as non-core purposes. Figure 12 shows the purpose classification results for the data flows. In total, 731 apps declare that their data flows are for core purposes, whereas 686 apps declare that their data flows are for non-core purposes. While the collection of “advertising id” (and other similar identifiers) for non-core purposes (i.e., advertising) can be acceptable, Figure 12 also shows that many apps collect sensitive data types, e.g., “phone number”, “email address”, and “geolocation” for non-core purposes, which is alarming.