Construction of Paired Knowledge Graph-Text Datasets Informed by Cyclic Evaluation

09/20/2023
by   Ali Mousavi, et al.
0

Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text from KG and vice versa. However models trained on datasets where KG and text pairs are not equivalent can suffer from more hallucination and poorer recall. In this paper, we verify this empirically by generating datasets with different levels of noise and find that noisier datasets do indeed lead to more hallucination. We argue that the ability of forward and reverse models trained on a dataset to cyclically regenerate source KG or text is a proxy for the equivalence between the KG and the text in the dataset. Using cyclic evaluation we find that manually created WebNLG is much better than automatically created TeKGen and T-REx. Guided by these observations, we construct a new, improved dataset called LAGRANGE using heuristics meant to improve equivalence between KG and text and show the impact of each of the heuristics on cyclic evaluation. We also construct two synthetic datasets using large language models (LLMs), and observe that these are conducive to models that perform significantly well on cyclic generation of text, but less so on cyclic generation of KGs, probably because of a lack of a consistent underlying ontology.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/30/2021

EventNarrative: A large-scale Event-centric Dataset for Knowledge Graph-to-Text Generation

We introduce EventNarrative, a knowledge graph-to-text dataset from publ...
research
07/20/2021

WikiGraphs: A Wikipedia Text - Knowledge Graph Paired Dataset

We present a new dataset of Wikipedia articles each paired with a knowle...
research
08/04/2023

Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text

The recent advances in large language models (LLM) and foundation models...
research
05/26/2023

MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

Autoregressive language models are trained by minimizing the cross-entro...
research
08/05/2021

Exploring Out-of-Distribution Generalization in Text Classifiers Trained on Tobacco-3482 and RVL-CDIP

To be robust enough for widespread adoption, document analysis systems i...
research
11/03/2020

Finding Friends and Flipping Frenemies: Automatic Paraphrase Dataset Augmentation Using Graph Theory

Most NLP datasets are manually labeled, so suffer from inconsistent labe...

Please sign up or login with your details

Forgot password? Click here to reset