Privacy Policy Question Answering Assistant: A Query-Guided Extractive Summarization Approach

by   Moniba Keymanesh, et al.
The Ohio State University

Existing work on making privacy policies accessible has explored new presentation forms such as color-coding based on the risk factors or summarization to assist users with conscious agreement. To facilitate a more personalized interaction with the policies, in this work, we propose an automated privacy policy question answering assistant that extracts a summary in response to the input user query. This is a challenging task because users articulate their privacy-related questions in a very different language than the legal language of the policy, making it difficult for the system to understand their inquiry. Moreover, existing annotated data in this domain are limited. We address these problems by paraphrasing to bring the style and language of the user's question closer to the language of privacy policies. Our content scoring module uses the existing in-domain data to find relevant information in the policy and incorporates it in a summary. Our pipeline is able to find an answer for 89



There are no comments yet.


page 3


Question Answering for Privacy Policies: Combining Computational and Legal Perspectives

Privacy policies are long and complex documents that are difficult for u...

Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset

Automated analysis of privacy policies has proved a fruitful research di...

Scaling Up Query-Focused Summarization to Meet Open-Domain Question Answering

Query-focused summarization (QFS) requires generating a textual summary ...

PolicyQA: A Reading Comprehension Dataset for Privacy Policies

Privacy policy documents are long and verbose. A question answering (QA)...

Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning

Privacy policies are the primary channel through which companies inform ...

APPCorp: A Corpus for Android Privacy Policy Document Structure Analysis

With the increasing popularity of mobile devices and the wide adoption o...

How IT allows E-Participation in Policy-Making Process

With the art and practice of government policy-making, public work, and ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

Online users often do not read or understand privacy policies due to the length and complexity of these unilateral contracts Cranor et al. (2006). This problem can be addressed by utilizing a presentation form that does not result in cognitive fatigue Doan et al. (2018); Wurman et al. (2001)

and satisfies the information need of users. To assist users with understanding the content of privacy policies and conscious agreement, previous computational work on privacy policies has explored using information extraction and natural language processing to create better presentation forms 

Ebrahimi et al. (2020). For example, PrivacyGuide Tesfay et al. (2018) and PrivacyCheck Zaeem et al. (2018) present an at-a-glance description of a privacy policy by defining a set of privacy topics and assigning a risk level to each topic. Harkous et al Harkous et al. (2018) and Mousavi Nejad et al Nejad et al. (2019) used information extraction and text classification to create a structured and color-coded view of the risk factors in the privacy policy. Manor et al Manor and Li (2019) and Keymanesh et al Keymanesh et al. (2020) explored incorporating the risky data practices in the privacy policies in form of a natural language summary. While great progress has been made to create more user-friendly presentation forms for the policies, users often only care about a subset of these issues or have a personal view of what is considered risky. Instead of presenting an overview or summary of privacy policies, an alternative approach is to allow them to ask questions about the issues that they care about and present an answer extracted from the content of the policies Ravichander et al. (2019a). This facilitates a more personal approach to privacy and enables users to review only the sections of the policy that they are most concerned about.

In this work, we take a step toward building an automated privacy policy question-answering assistant. We propose to extract an output summary in response to the user query. Our task is related to guided and controllable text summarization Kryściński et al. (2019); Dang and Owczarzak (2008); Keymanesh et al. (2021); Fan et al. (2017); Sarkhel et al. (2020) as well as reading comprehension He et al. (2020). However, a few application-imposed constraints make this task more challenging than traditional evaluation setup of reading comprehension systems. First, users tend to pose questions to the privacy policy question-answering system that are not-relevant, out-of-scope (‘how many data breaches did you have in the past?’), subjective ( e.g. ‘how do I know this app is legit?’), or too specific to answer using the privacy policy ( e.g. ‘does it have access to financial apps I use?’Ravichander et al. (2019a). Moreover, even answerable user questions can have a very different style and language in comparison to the legal language used in privacy policies Ravichander et al. (2019b), making it difficult for the automated assistant to identify the user’s intent and find the relevant information in the document. This issue of domain shift is exacerbated due to the difficulty of annotating data for this domain. Because the existing datasets for this task are fairly small Ahmad et al. (2020), the problems cannot be solved by simply training a supervised model.

We address the first problem by using query expansion and paraphrasing to bring the style, language, and specificity of the user’s question closer to the language of privacy policies. To do so, we use lexical substitution and back-translation. Next, using the expanded query-set, we compute a relevance and informativeness score for each segment of the privacy policy using a transformer-based language representation model fine-tuned on in-domain data. Finally, we incorporate the top scored segments in form of a summary. We show that using a few in-domain datasets annotated for slightly different tasks, we are able extract a relevant summary for 89% of the user queries in the PrivacyQA dataset. We discuss our proposed hybrid summarization pipeline in Section 2. We introduce the datasets we use for training and testing different modules of our model in Section 3. Finally, we present our experiments and results in Section 4.

2 Proposed Query-Guided Extractive Summmarization Pipeline

Figure 1: Overview of the proposed pipeline

In this section, we discuss our proposed pipeline. Our query-guided extractive summarization pipeline includes three main components. Given a privacy policy document and a user query, the first component - the query expansion module, processes the user query and generates a set of paraphrases that have a more similar language, style, and specificity to the content of the privacy policy. Next, given the query and the paraphrase set, the content scoring module computes two scores—relevance and informativeness—for each segment of the policy. Lastly, the two scores are combined for the expanded query set to obtain the final answerability score for each segment. Segments are then ranked based on the answerability score and top ranked items are shown in the form of a summary to the user. The overview of the pipeline as well as an input-output example is shown in Figure 1. Next, we discuss the motivation behind including each component in the pipeline and explain them in more detail.

2.1 Query Expansion

Question-answering systems are very sensitive to many different ways the same information need can be articulated Dong et al. (2017). As a result, small variations in semantically similar queries can yield different answers. This is especially a challenge in building a question-answering assistant for privacy policies. Often, users are not very good at articulating their privacy-related inquiries and use a style and language that is very different from the legal language used in the privacy policies Ravichander et al. (2019a).

Query expansion by paraphrasing has been used in the past to improve the performance of the QA-based information retrieval Zukerman and Raskutti (2002); Riezler et al. (2007); Azad and Deepak (2019). We employ several methods from the literature, testing their applicability to this domain and in particular to the issues caused by mismatch between external training resources, user queries and the privacy policies themselves. To increase the diversity and coverage of the generated paraphrases, we employ methods based on lexical substitution McCarthy and Navigli (2009); Jin et al. (2018)

and neural machine translation (NMT) 

Mallinson et al. (2017); Sutskever et al. (2014). Note that the paraphrase generation module is independent of the neural-based content scoring module and thus, any method can be used to generate paraphrases. Below, we discuss the three methods used for generating query paraphrases.

2.1.1 Lexical Substitution

Lexical substitution can be done by simply replacing a word with an appropriate synonym/paraphrase in a way that the meaning is not changed. For example, the sentence ’what information is collected about me?’ can be written as ’what information is collected about the user?’. For generating paraphrases we employ two lexical substitution methods: 1) replacement with similar words based on Word2Vec representations Le and Mikolov (2014) and 2) a collection of hand-crafted lexical replacements rules aimed to bring the style and language of user queries closer to the legal language in the privacy policies.111We also tried using WordNet Miller (1998) for lexical substitution. However, our preliminary experiments suggest that majority of paraphrases generated using this method are not meaningful and thus a more rigorous filtering mechanism should be used to identify useful paraphrases. Thus, we have decided not to use WordNet.

Word2Vec: we train the Word2Vec model on a corpus of 150 privacy policies collected by Keymanesh et al. (2020) to learn word representations. To create paraphrases, we substitute nouns and verbs in user queries with the top 5 most similar words in the embedding space that have the same part of speech.

Lexical replacement rules: to bring the language and style of user queries closer to the language of privacy policies, we manually create a collection of 50+ lexical substitution rules. For example, our rules can replace the word "my" with "user’s" and "phone" with "device". We test two variations of this approach: (i) single-replacement, in which we only apply a single replacement rule to generate a paraphrase and (ii) all-replacement, in which we apply all possible lexical substitution rules to generate a paraphrase.

2.1.2 NMT-based Paraphrase Generation

One of the well-known approaches for paraphrase generation is bilingual pivoting Bannard and Callison-Burch (2005); Jia et al. (2020); Jin et al. (2018); Dong et al. (2017). In this approach, a bilingual parallel corpus is used for learning paraphrases based on techniques from paraphrase-based statistical machine translation Koehn et al. (2003). Intuitively, two sentences in a source language that translates to the same sentence in a target language can be assumed to have the same meaning.  Mallinson et al. (2017)

show how the bilingual pivoting method can be ported into NMT and present a paraphrasing method purely based on neural networks. In our work, we use German as our pivot language following 

Mallinson et al. (2017), who suggest that it outperforms other languages in several paraphrasing experiments. We employ a simple back translation method to automatically create paraphrases for user queries using Google Translate 222 which is a mature and publicly available online service to translate user queries from English to German and back from German to English.

2.2 Content Scoring

Given the segment set of the privacy policy, a user query and paraphrase set obtained by paraphrasing using methods explained in 2.1, we aim to extract the most relevant segments of the privacy policy that fully or partially answer at least one of the paraphrased questions and incorporate them in a summary. To do so, for each pair of paraphrase-segment pair , we compute two scores that we call the relevancy score and informativeness score (Both scores are computed using BERT Devlin et al. (2018), but we employ different problem formulations, discussed below). We combine these two scores to get the final answerability score for paraphrase-segment pair . Finally we compute the maximum answerability scores of the paraphrase set to get answerability score of query . We represent the answerability score of segment for query with . Next, we discuss how these scores are computed.

Relevance Score: To compute the relevance score for a paraphrase-segment pair , we formulate this as a sentence-pair classification task. In this task, given a question and segment , the goal is to predict whether is relevant to . To compute the relevance score, we rely on a transformer-based language representation model Devlin et al. (2018) pretrained on legal contracts called legal-bert Chalkidis et al. (2020) 333the base bert model pretrained on contracts is obtained from We fine-tune this model for sentence pair classification task on the train set of the Privacy QA dataset proposed in  Ravichander et al. (2019a)

for 3 epochs. PrivacyQA is a corpus of privacy policy segments annotated as

"relevant" or "irrelevant" for a set of user queries. We will further discuss this dataset in Section 3. During fine-tuning, we pass the question-segment pairs separated with the special token with question and segment using different segment embeddings. We also add a special token in the beginning and a token

at the end of the sequence. We use weighted binary cross-entropy as our loss function and update the encoder weights during the fine-tuning. The final hidden vector for the first input token

is fed to the output layer for the relevance classification task. We use the fine-tuned model to get the posterior probability of relevancy for each paraphrase-segment sequence.

Informativeness Score: Even if a segment of the policy is relevant to a question, it might not fully answer it. To account for this, we also train a span-detection QA system similar to those used for SQUAD question answering Rajpurkar et al. (2016). In preliminary experiments, we do not find that this system always extracts spans which are legible enough on their own for presentation to the user; this is partly due to the complex, contextually-sensitive language used in the contracts. However, we do find that the system’s ability to find a promising span provides another indication that the text segment contains a potential answer. Thus, for each question-segment pair we compute an informativeness score which measures how informative is in answering . To compute this score, we fine-tune the legal-bert Chalkidis et al. (2020) for question-answering task on the train set of the PolicyQA dataset Ahmad et al. (2020). This dataset contains reading comprehension style question and answer pairs from a corpus of privacy policies. We will further discuss this dataset in Section 3. We refer to the legal-bert fine-tuned for question-answering as the "answer-detector" module in our experiment in Section 4.1. During fine-tuning, we feed a query and segment of the policy as a packed sequence separated by the special token with the question and the segment using different segment embeddings. In addition, a start vector and an end vector

are introduced during the fine-tuning process. For each token in the sequence two probabilities are computed: i) the probability of word

being the start of the answer span and ii) the probability of word being the end of the answer span. To compute the start-of-answer-span probability we compute the dot-product of the token vector and the start vector followed by a softmax over all tokens in the segment. A similar formula is used to compute the end-of-answer-span probability for each token. The training objective during fine-tuning is to maximize the sum of the log-likelihoods of the correct start and end positions. The informativenss score of the span from position to position is defined as where . We represent the informativeness score of the segment with respect to the paraphrase with and compute it by taking the maximum score of the spans within the segment:

Where .

2.3 Answer Ranking and Selection

Finally, to compute the answerability score for each paraphrase-segment pair , we simply sum up the relevance score and informativeness score  444We also tried training a regression model using and as inputs and the relevance labels from PrivacyQA as the target variable. However, reusing PrivacyQA labels seems to result-in over-fitting. Thus, we decided the combine the scores by simply summing them up.:

is computed for all paraphrase-segment pairs . However, both lexical substitution and back-translation can generate paraphrases that are semantically or syntactically incorrect. To discard the less useful paraphrases, for each query-segment pair we compute the maximum of the answerability score that paraphrases of user query obtained in the previous step 555We also tried a variation in which we compute the average answerability score of paraphrase set . However, our experiments indicated that computing the maximum is more effective as it discards the less useful paraphrases. :

Where represents the set of paraphrases generated for user query . Finally, we rank segments based on their answerability score with respect to the input user query . The final query-guided summary is built by concatenating the top ranked segments. In our experiments in Section 4 we show the results of including the top 5 and top 10 ranked segments in the summary. Our query-guided extractive summarization pipeline is shown in Figure 1. In the next section, we introduce the datasets used for fine-tuning and testing our proposed pipeline.

3 Datasets

#Questions #Policies
Avg. passages
per question
Avg. relevant
passage per question
Train 1350 27 425 137.1 5.2
Test 400 8 34 155.3 15.5
Table 1: Statistics of PrivacyQA Dataset; where # denotes number of questions, policies, and out-of-scope questions. Out-of-score questions refer to questions for which no segment in the policy is annotated as relevant. We also report the average number of annotated passages/segments and the average number of relevant segments for each question.
Questions #Policies #Q&A pairs
Avg. question
Avg. passage
Avg. answer
Train 693 75 17,056 11.2 106.0 13.3
Valid 568 20 3,809 11.2 96.6 12.8
Test 600 20 4,152 11.2 119.1 14.1
Table 2: Statistics of PolicyQA Dataset; where # denotes number of questions, policies, and Q&A pairs. We also report the average number of words in passages/segments, questions, and answer spans.

We rely on three publicly available data sets for training and testing different modules in our proposed pipeline. As mentioned earlier, we train the word2vec model used for lexical substitution on the set of 150 privacy policies collected by Keymanesh et al Keymanesh et al. (2020). In addition, we employ two datasets called PrivacyQA Ravichander et al. (2019a) and PolicyQA Ahmad et al. (2020) for fine-tuning the legal bert model for sentence pair classification and question-answering tasks respectively. PrivacyQA is a sentence-selection style question-answering dataset where each question is answered with a list of sentences. On the other hand, PolicyQA is a reading-comprehension style question-answering dataset in which a question is answered with a sequence of words. Next, we introduce these datasets in more details.

PrivacyQA: Ravichander et al Ravichander et al. (2019a) asked each crowd worker in their study to formulate 5 privacy questions about privacy policies of a set of 35 mobile applications. The crowd workers were only exposed to the public information about each company. In addition, they were not required to read the privacy policies to formulate their questions. Thus, this dataset presents a more realistic view of what type of questions are likely to be posed to an automotive privacy policy question-answering assistant. Given the questions formulated by Mechanical Turkers, four experts with legal training annotated paragraphs on the privacy policy as "relevant" or "irrelevant" considering each query. We consider a segment of the privacy policy as relevant if at least one of the annotators marked it as relevant. The datasets statistics is shared in Table 1. We use the train portion of this data set for fine-tuning the legal bert model Chalkidis et al. (2020) for the sentence pair classification task and computing the relevance score. We use the test of the PrivacyQA dataset to evaluate our proposed pipeline. We share our results in Section 4.

PolicyQA: This dataset is curated by Ahmad et al Ahmad et al. (2020) and contains 25,017 reading-comprehension style question and answer-span pairs extracted from a corpus of 115 privacy policies Wilson et al. (2016). The train portion of this dataset contains 693 human-written questions with an average answer length of 13.3 words. To curate this dataset, two domain experts used the triple annotations {Practice, Attribute, Value} from the OPP-115 dataset Wilson et al. (2016) to come up with the questions. For instance, given the triple annotation {First Party Collection/Use, Personal Information Type, Contact} and the corresponding answer span “name, address, telephone number, email address” the annotators formulated questions such as, "What type of contact information does the company collect?" and "Will you use my contact information?". Note that during the annotation process, the domain experts were asked to formulate questions given the content of the privacy policy. Therefore, PolicyQA questions are less diverse than PrivacyQA and do not fully reflect a real-world user-interaction with a privacy policy question-answering assistant. Thus, we only use this dataset for fine-tuning the legal bert model Chalkidis et al. (2020) for question-answering task and computing the informativeness score. We do not use this dataset for evaluation. The statistics of the PolicyQA dataset are shared in Table 2.

4 Experiments and Results

Rule-based (one)
Rule-based (all)
Back-translation Word2Vec All
Average #paraphrases 1.4 0.4 0.9 2.9 5.7
%Retrieved relevant segments 34.5 19.8 25.4 54.0 67.5
%Answerable paraphrases 24.2 31.3 25.4 28.9 27.3
Table 3: The average number of paraphrases that could be generated using each method. The percentage of generated answerable paraphrases for non-answerable queries and the percentage of relevant segments that were answerable using at least one of the generated paraphrases by each method.

In this section, we present our experimental results. As stated earlier, given a query and a privacy policy, the first module of our framework- the query expansion module- brings the style and language of the user-queries closer to the language of the privacy policy by paraphrasing. Next, given the paraphrases for the input query, the content scoring and answer selection modules retrieve the most relevant snippets of the privacy policy in form of a summary. In our experiments, we aim to answer the following questions: (i) Does the query expansion module generate paraphrases that have a closer language than the input query to the privacy policy? (ii) If so, what proportion of the generated paraphrases are more answerable than the input user query? (iii) Does the proposed pipeline succeed in retrieving the relevant sections of the privacy policy in answer to the user queries?, and (iv) Which modules in our pipeline are essential for finding relevant answers to user queries?

Our experiments presented in Section 4.1 answer question one and two. Experiments in Section 4.2 answer question three and four. For our experiments, we rely on the test set of the PrivacyQA dataset as it presents a more realistic user interaction with a privacy policy assistant. This dataset is introduced in Section 3.

4.1 Query Expansion Results:

Setup F@5 F@10 P@5 P@10 MRR
Full Pipeline 80.6 89.0 39.4 32.8 0.59
Query Expansion 80.2 87.9 38.4 32.7 0.59
Query Expansion Answer-detector 78.3 86.6 41.1 33.8 0.63
Table 4: The performance of different variations of our model in retrieving the relevant segments of the policy in response to user queries. %F@K represent percentage of queries for which at least one relevant segment was found within the top k ranked items. P@K and MRR represent precision at K and mean reciprocal rank.

To answer our first and second questions regarding the quality and answerability of the generated paraphrases, we use lexical substitution methods and back-translation for expanding user queries in the test set of the PrivacyQA dataset. These approaches are introduced in Section 2.1. The average number of paraphrases generated by each approach is presented in Table 3. On average, using these methods, we can create 5.7 paraphrases for each query. The two variations of the rule-based approach, the single-replacement and all-replacement generate 1.4 and 0.4 paraphrases on average. Note that for some queries only one substitution rule can be applied and thus, the all-replacement variation does not generate any new paraphrases. The back-translation method creates 0.9 paraphrases on average; using this method may not always generate novel text 666In this work we only use the NMT architecture used by Google translate and German language. Using more architectures or more target languages can expand the pool of generated paraphrases.. Word2Vec generates more paraphrases than other methods, generating 2.9 paraphrases on average.

To measure the language similarity between paraphrase and segment , we conduct the following experiment. We hypothesize that the answer-detector model introduced in Section 2.3, can successfully detect the answer span within the relevant segment of the policy if the query has a similar language and style to the privacy policy text. Note that in this problem Recall is more important than Precision. Meaning that being able to extract all the relevant information from the policy is more crucial than falsely including a few irrelevant sentences in the summary. Our experimental design reflects this domain-imposed requirement. We pass the paraphrase, segment pair that are annotated as "relevant", to answer-detector model and save the extracted answer span 777We exclude 34 queries for which there was no relevant information in the policy (out-of-scope questions). . In cases that the paraphrase and segment do not have a similar language the model typically returns no answer 888This includes an empty sequence or special token . In our experiments, we observe that for 342 of initial user queries and relevant segment of the policy (around 5.5% of all pairs), the answer-detector model can’t find the answer span. We interpret this as user queries having a different style and language from the policy text. To answer our first question, we measure the percentage of cases for which the expansion method generated an answerable paraphrase. This is shown in Table 3 as the percentage retrieved relevant segments. As shown in the Table, the rule-based approach (one-replacement) and Word2Vec are able to generate at least one answerable paraphrase for 34.5% and 54% of the previously non-answerable cases. Note that these two methods on average generate more paraphrases for each query while back-translation cannot change the language and style of the user query in some cases. We also observe that 67.5% of all the non-answerable cases could be answered by at least one of the query expansion approaches. Thus, for better recall, we include all the expansion methods in our pipeline and filter out the non-answerable paraphrases in the next step.

To answer our second question regarding the quality of the generated paraphrases, we report the percentage of all generated paraphrases by each method that were answerable in Table 3. Note that different expansion methods generate different number of paraphrases. As presented in the Table, 27.3% of all the paraphrases were answerable. The all-replacement variation of the rule-based method and Word2Vec generate better quality paraphrases in comparison to back-translation (31.3 and 28.9 in comparison to 25.4). We conjecture that this is due to domain mismatch. The NMT architecture generates high-quality paraphrases, but it is trained on out-of-domain data and therefore has no bias to restate the query in a way that makes it match the privacy policies better. On the other hand, the domain-guided rule-based model and word vectors may be less fluent in paraphrasing, but are trained on in-domain data, which allows them to generate better matches.

4.2 Query-guided Summarization Results:

To answer our third and forth question, we evaluate the performance of our pipeline in retrieving the relevant segments of the policy given a user query. Essentially, following the query-expansion, we use the content scoring module to generate the relevance score and informativeness score for each paraphrase-segment pair. This process is discussed in Section 2.2. Finally, the answer ranking module combines these scores for the entire paraphrase set and ranks the segments based on their final answerability score.

Since we do not have reference summaries for this dataset, we do not use conventional summarization metrics Lin (2004); Zhang et al. (2019); Sellam et al. (2020), to evaluate our results; rather, we evaluate the ability of the pipeline in retrieving the "relevant" segments of the policy within the top ranked items. We rely on metrics used for the evaluation of information retrieval systems. We report the precision@k (P@K) and the mean reciprocal rank (MRR) of the retrieved relevant segments in Table 4. Precision@k indicates the fraction of the relevant segments in the top ranked items. We report the average value across queries in the test set. Mean reciprocal rank indicates the multiplicative inverse of the position of the first relevant segment in the resulted ranking. A perfect ranking system achieves a MRR of 1 by always ranking a relevant segment in the first position. We also report the percentage of queries in our test set for which at least one relevant snippet was listed at the top k retrieved items (shown as F@k in the Table). We observe that our model is able to retrieve a relevant answer for 80.6% of the queries within the top 5 results and 89% of the queries within the top 10 results. Note that 8.5% of the queries in the test set of PrivacyQA are out-of-scope and cannot be answered solely based on the content of the policy. We also observe that on average, 39.4% of the top 5 ranked passages are actually relevant. Increasing the summarization budget to include top 10 retrieved passages decreases the precision to 32.8. The MRR of the full pipeline is 0.59 indicating that on average the first relevant item appears in the second position in the ranking or higher.

To evaluate the contribution of each module in our pipeline we conduct an ablation study. To test the effect of the query-expansion, we use the pipeline without this module for finding answers to user queries. We observe that the percentage of queries answered withing top 5 and top 10 ranked items slightly decreases (-0.4% and -1.1% respectively). We conclude that the query expansion module slightly boosts the performance of the model. However, for most queries in the test set the model is already able to find at least one relevant passage without using paraphrases. The advantage of using query-expansion is more noticeable when coverage of all relevant information is more critical and summarization budget is larger.

In our next experiment, in addition to the query-expansion, we also remove the answer-detector from our pipeline. In this version of our model, passages are only ranked based on the relevance score. We notice that this further decreases the F@5 and F@10. However, the precision of the top ranked items and MRR slightly improves. We conclude that both the query expansion and answer-detector components are effective in the ability of our model in finding relevant answers to user queries. However, in cases where short answers are desirable (low summarization budget), content scoring based on only relevancy score would be better.

5 Conclusion and Future Work

In this work, we take a step toward building an automated privacy policy question-answering assistant. This presentation form provides a more personalized interaction with privacy policies in comparison to the previous approaches. We address two main challenges in this domain: (i) difference between language and style of user queries and the legal language of the privacy polices and (ii) low training resources. To so do, we propose a query-guided summarization pipeline that first uses lexical substitution and back-translation to bring the language and style of the user queries closer to the language of the policies. Next, we use a language representation model fine-tuned on existing in-domain data to compute a relevancy and informativeness score for each segment in the policy regarding the user query. Finally, the top ranked passages are presented to the user in form of a summary.

Our proposed pipeline can successfully find the relevant information in the privacy policy for 89% of the queries in the privacyQA dataset. We observed that using a domain-inspired rule-based approach and training word-vectors on in-domain data is more effective than an out-of-domain NMT-based paraphrase generation approach for bringing the language and style of user queries closer to the language of the privacy policy. However, for a high-recall retrieval system it is better to combine several expansion methods. In addition, we observed that relying on existing in-domain resources for building a question-answering assistant provides a sufficiently high-recall retrieval system. However, more resources are required for increasing the precision of the ranking system.

Several issues are left for future work. First, the evaluation of the system using Information Retrieval metrics is insufficient to determine its usefulness in practice. Collection of reference summaries and a user study comparing different methods of interaction with privacy policies could help us determine which methods best meet the needs of real-world users. Second, the proposed method answers user queries based solely on the text of the privacy policy, rendering many user queries unanswerable. Using additional legal resources might help to address these out-of-scope questions.


  • W. U. Ahmad, J. Chi, Y. Tian, and K. Chang (2020) PolicyQA: a reading comprehension dataset for privacy policies. arXiv preprint arXiv:2010.02557. Cited by: §1, §2.2, §3, §3.
  • H. K. Azad and A. Deepak (2019) Query expansion techniques for information retrieval: a survey. Information Processing & Management 56 (5), pp. 1698–1735. Cited by: §2.1.
  • C. Bannard and C. Callison-Burch (2005) Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pp. 597–604. Cited by: §2.1.2.
  • I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, and I. Androutsopoulos (2020) LEGAL-bert: the muppets straight out of law school. arXiv preprint arXiv:2010.02559. Cited by: §2.2, §2.2, §3, §3.
  • L. F. Cranor, P. Guduru, and M. Arjula (2006) User interfaces for privacy agents. TOCHI (), pp. . Cited by: §1.
  • H. T. Dang and K. Owczarzak (2008) Overview of the tac 2008 update summarization task.. In TAC, Cited by: §1.
  • J. Devlin, M. Chang, et al. (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Cited by: §2.2, §2.2.
  • A. Doan, P. Konda, A. Ardalan, J. R. Ballard, S. Das, Y. Govind, H. Li, P. Martinkus, S. Mudgal, E. Paulson, et al. (2018)

    Toward a system building agenda for data integration (and data science).

    IEEE Data Eng. Bull. 41 (2), pp. . Cited by: §1.
  • L. Dong, J. Mallinson, S. Reddy, and M. Lapata (2017) Learning to paraphrase for question answering. arXiv preprint arXiv:1708.06022. Cited by: §2.1.2, §2.1.
  • F. Ebrahimi, M. Tushev, and A. Mahmoud (2020) Mobile app privacy in software engineering research: a systematic mapping study. Information and Software Technology, pp. 106466. Cited by: §1.
  • A. Fan, D. Grangier, and M. Auli (2017) Controllable abstractive summarization. arXiv preprint:1711.05217. Cited by: §1.
  • H. Harkous, K. Fawaz, R. Lebret, et al. (2018)

    Polisis: automated analysis and presentation of privacy policies using deep learning

    In 27th USENIX Security Symposium (USENIX Security 18), Cited by: §1.
  • J. He, W. Kryściński, B. McCann, N. Rajani, and C. Xiong (2020) CTRLsum: towards generic controllable text summarization. arXiv preprint arXiv:2012.04281. Cited by: §1.
  • X. Jia, W. Zhou, S. Xu, and Y. Wu (2020) How to ask good questions? try to leverage paraphrases. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6130–6140. Cited by: §2.1.2.
  • L. Jin, D. King, A. Hussein, M. White, and D. Danforth (2018) Using paraphrasing and memory-augmented models to combat data sparsity in question interpretation with a virtual patient dialogue system. In Proceedings of the thirteenth workshop on innovative use of NLP for building educational applications, pp. 13–23. Cited by: §2.1.2, §2.1.
  • M. Keymanesh, T. Berger-Wolf, M. Elsner, and S. Parthasarathy (2021) Fairness-aware summarization for justified decision-making. arXiv preprint arXiv:2107.06243. Cited by: §1.
  • M. Keymanesh, M. Elsner, and S. Parthasarathy (2020) Toward domain-guided controllable summarization of privacy policies. Natural Legal Language Processing Workshop at KDD. Cited by: §1, §2.1.1, §3.
  • P. Koehn, F. J. Och, and D. Marcu (2003) Statistical phrase-based translation. Technical report UNIVERSITY OF SOUTHERN CALIFORNIA MARINA DEL REY INFORMATION SCIENCES INST. Cited by: §2.1.2.
  • W. Kryściński, N. S. Keskar, B. McCann, C. Xiong, and R. Socher (2019) Neural text summarization: a critical evaluation. arXiv preprint arXiv:1908.08960. Cited by: §1.
  • Q. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In

    International conference on machine learning

    pp. 1188–1196. Cited by: §2.1.1.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Link Cited by: §4.2.
  • J. Mallinson, R. Sennrich, and M. Lapata (2017) Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 881–893. Cited by: §2.1.2, §2.1.
  • L. Manor and J. J. Li (2019) Plain English summarization of contracts. In Proceedings of the Natural Legal Language Processing Workshop 2019, Minneapolis, Minnesota. Cited by: §1.
  • D. McCarthy and R. Navigli (2009) The english lexical substitution task. Language resources and evaluation 43 (2), pp. 139–159. Cited by: §2.1.
  • G. A. Miller (1998) WordNet: an electronic lexical database. MIT press. Cited by: footnote 1.
  • N. M. Nejad, D. Graux, and D. Collarana (2019) Towards measuring risk factors in privacy policies. In ICAIL, Cited by: §1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §2.2.
  • A. Ravichander, A. Black, E. Hovy, J. Reidenberg, N. C. Russell, and N. Sadeh (2019a) Challenges in automated question answering for privacy policies. In , Cited by: §1, §1, §2.1, §2.2, §3, §3.
  • A. Ravichander, A. W. Black, S. Wilson, T. Norton, and N. Sadeh (2019b) Question answering for privacy policies: combining computational and legal perspectives. arXiv preprint arXiv:1911.00841. Cited by: §1.
  • S. Riezler, A. Vasserman, I. Tsochantaridis, V. O. Mittal, and Y. Liu (2007) Statistical machine translation for query expansion in answer retrieval. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 464–471. Cited by: §2.1.
  • R. Sarkhel, M. Keymanesh, A. Nandi, and S. Parthasarathy (2020) Interpretable multi-headed attention for abstractive summarization at controllable lengths. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 6871–6882. Cited by: §1.
  • T. Sellam, D. Das, and A. P. Parikh (2020)

    BLEURT: learning robust metrics for text generation

    arXiv preprint arXiv:2004.04696. Cited by: §4.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §2.1.
  • W. B. Tesfay, P. Hofmann, T. Nakamura, S. Kiyomoto, and J. Serna (2018) Privacyguide: towards an implementation of the eu gdpr on internet privacy policy evaluation. In IWSPA, Cited by: §1.
  • S. Wilson, F. Schaub, A. A. Dara, F. Liu, et al. (2016) The creation and analysis of a website privacy policy corpus. In ACL, pp. . Cited by: §3.
  • R. S. Wurman, L. Leifer, D. Sume, and K. Whitehouse (2001) Information anxiety 2. Que. Cited by: §1.
  • R. N. Zaeem, R. L. German, and K. S. Barber (2018) Privacycheck: automatic summarization of privacy policies using data mining. TOIT) (), pp. . Cited by: §1.
  • T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2019) Bertscore: evaluating text generation with bert. arXiv preprint arXiv:1904.09675. Cited by: §4.2.
  • I. Zukerman and B. Raskutti (2002) Lexical query paraphrasing for document retrieval. In COLING 2002: The 19th International Conference on Computational Linguistics, Cited by: §2.1.