Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration

07/14/2022
by   Zhenyu Zhang, et al.
0

Building document-grounded dialogue systems have received growing interest as documents convey a wealth of human knowledge and commonly exist in enterprises. Wherein, how to comprehend and retrieve information from documents is a challenging research problem. Previous work ignores the visual property of documents and treats them as plain text, resulting in incomplete modality. In this paper, we propose a Layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents (VRDs), so as to generate accurate responses in dialogue systems. LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents, becoming the largest VRD-based information extraction dataset to the best of our knowledge. We also develop benchmark methods that extend the token-based language model to consider layout features like humans. Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/10/2021

DIALKI: Knowledge Identification in Conversational Systems through Dialogue-Document Contextualization

Identifying relevant knowledge to be used in conversational systems that...
research
07/04/2017

Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity

Rule-based information extraction has lately received a fair amount of a...
research
03/23/2018

WikiRank: Improving Keyphrase Extraction Based on Background Knowledge

Keyphrase is an efficient representation of the main idea of documents. ...
research
07/12/2021

Inscriptis – A Python-based HTML to text conversion library optimized for knowledge extraction from the Web

Inscriptis provides a library, command line client and Web service for c...
research
11/07/2021

Information Extraction from Visually Rich Documents with Font Style Embeddings

Information extraction (IE) from documents is an intensive area of resea...
research
06/28/2015

WYSIWYE: An Algebra for Expressing Spatial and Textual Rules for Visual Information Extraction

The visual layout of a webpage can provide valuable clues for certain ty...
research
10/22/2018

Baseline Detection in Historical Documents using Convolutional U-Nets

Baseline detection is still a challenging task for heterogeneous collect...

Please sign up or login with your details

Forgot password? Click here to reset