DeepAI AI Chat
Log In Sign Up

Schema-Driven Information Extraction from Heterogeneous Tables

by   Fan Bai, et al.

In this paper, we explore the question of whether language models (LLMs) can support cost-efficient information extraction from complex tables. We introduce schema-driven information extraction, a new task that uses LLMs to transform tabular data into structured records following a human-authored schema. To assess various LLM's capabilities on this task, we develop a benchmark composed of tables from three diverse domains: machine learning papers, chemistry tables, and webpages. Accompanying the benchmark, we present InstrucTE, a table extraction method based on instruction-tuned LLMs. This method necessitates only a human-constructed extraction schema, and incorporates an error-recovery strategy. Notably, InstrucTE demonstrates competitive performance without task-specific labels, achieving an F1 score ranging from 72.3 to 95.7. Moreover, we validate the feasibility of distilling more compact table extraction models to minimize extraction costs and reduce API reliance. This study paves the way for the future development of instruction-following models for cost-efficient table extraction.


page 3

page 8


Schemaless Queries over Document Tables with Dependencies

Unstructured enterprise data such as reports, manuals and guidelines oft...

TabLeX: A Benchmark Dataset for Structure and Content Information Extraction from Scientific Tables

Information Extraction (IE) from the tables present in scientific articl...

DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Materials Science Articles

A crucial component in the curation of KB for a scientific domain is inf...

Text-to-Table: A New Way of Information Extraction

We study a new problem setting of information extraction (IE), referred ...

InstructUIE: Multi-task Instruction Tuning for Unified Information Extraction

Large language models have unlocked strong multi-task capabilities from ...

Relation Extraction from Tables using Artificially Generated Metadata

Relation Extraction (RE) from tables is the task of identifying relation...