Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies

04/19/2022
by   Md Rizwan Parvez, et al.
0

Prior studies in privacy policies frame the question answering (QA) tasks as identifying the most relevant text segment or a list of sentences from the policy document for a user query. However, annotating such a dataset is challenging as it requires specific domain expertise (e.g., law academics). Even if we manage a small-scale one, a bottleneck that remains is that the labeled data are heavily imbalanced (only a few segments are relevant) –limiting the gain in this domain. Therefore, in this paper, we develop a novel data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascaded them with noise reduction oracles. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10 ablation studies provide further insights into the effectiveness of our approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/06/2020

PolicyQA: A Reading Comprehension Dataset for Privacy Policies

Privacy policy documents are long and verbose. A question answering (QA)...
research
05/25/2022

Intermediate Training on Question Answering Datasets Improves Generative Data Augmentation

Manually annotating datasets requires domain experts to read through man...
research
09/16/2023

PDFTriage: Question Answering over Long, Structured Documents

Large Language Models (LLMs) have issues with document question answerin...
research
06/08/2023

Improving Vietnamese Legal Question–Answering System based on Automatic Data Enrichment

Question answering (QA) in law is a challenging problem because legal do...
research
09/14/2023

CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain Performance and Calibration

In recent years, large language models (LLMs) have shown remarkable capa...
research
09/29/2021

Privacy Policy Question Answering Assistant: A Query-Guided Extractive Summarization Approach

Existing work on making privacy policies accessible has explored new pre...

Please sign up or login with your details

Forgot password? Click here to reset