PDFTriage: Question Answering over Long, Structured Documents

09/16/2023
by   Jon Saad-Falcon, et al.
0

Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either structure or content. Our experiments demonstrate the effectiveness of the proposed PDFTriage-augmented models across several classes of questions where existing retrieval-augmented LLMs fail. To facilitate further research on this fundamental problem, we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA.

READ FULL TEXT

page 3

page 6

page 16

page 17

research
06/01/2021

End-to-End Multihop Retrieval for Compositional Question Answering over Long Documents

Answering complex questions from long documents requires aggregating mul...
research
02/18/2022

Modelling the semantics of text in complex document layouts using graph transformer networks

Representing structured text from complex documents typically calls for ...
research
03/19/2020

QnAMaker: Data to Bot in 2 Minutes

Having a bot for seamless conversations is a much-desired feature that p...
research
04/24/2023

Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering

Large language models (LLMs) have received significant attention by achi...
research
08/22/2023

Knowledge Graph Prompting for Multi-Document Question Answering

The 'pre-train, prompt, predict' paradigm of large language models (LLMs...
research
04/19/2022

Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies

Prior studies in privacy policies frame the question answering (QA) task...
research
05/03/2023

Doc2SoarGraph: Discrete Reasoning over Visually-Rich Table-Text Documents with Semantic-Oriented Hierarchical Graphs

Discrete reasoning over table-text documents (e.g., financial reports) g...

Please sign up or login with your details

Forgot password? Click here to reset