Sanitizing Synthetic Training Data Generation for Question Answering over Knowledge Graphs

09/10/2020
by   Trond Linjordet, et al.
0

Synthetic data generation is important to training and evaluating neural models for question answering over knowledge graphs. The quality of the data and the partitioning of the datasets into training, validation and test splits impact the performance of the models trained on this data. If the synthetic data generation depends on templates, as is the predominant approach for this task, there may be a leakage of information via a shared basis of templates across data splits if the partitioning is not performed hygienically. This paper investigates the extent of such information leakage across data splits, and the ability of trained models to generalize to test data when the leakage is controlled. We find that information leakage indeed occurs and that it affects performance. At the same time, the trained models do generalize to test data under the sanitized partitioning presented here. Importantly, these findings extend beyond the particular flavor of question answering task we studied and raise a series of difficult questions around template-based synthetic data generation that will necessitate additional research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/30/2023

Question Answering with Deep Neural Networks for Semi-Structured Heterogeneous Genealogical Knowledge Graphs

With the rising popularity of user-generated genealogical family trees, ...
research
10/24/2022

Multi-Type Conversational Question-Answer Generation with Closed-ended and Unanswerable Questions

Conversational question answering (CQA) facilitates an incremental and i...
research
09/13/2021

End-to-End Entity Resolution and Question Answering Using Differentiable Knowledge Graphs

Recently, end-to-end (E2E) trained models for question answering over kn...
research
05/13/2020

Entity-Enriched Neural Models for Clinical Question Answering

We explore state-of-the-art neural models for question answering on elec...
research
07/02/2020

IIE-NLP-NUT at SemEval-2020 Task 4: Guiding PLM with Prompt Template Reconstruction Strategy for ComVE

This paper introduces our systems for the first two subtasks of SemEval ...
research
05/05/2023

Data Encoding For Healthcare Data Democratisation and Information Leakage Prevention

The lack of data democratization and information leakage from trained mo...
research
06/29/2022

How Train-Test Leakage Affects Zero-shot Retrieval

Neural retrieval models are often trained on (subsets of) the millions o...

Please sign up or login with your details

Forgot password? Click here to reset