Structured information extraction from complex scientific text with fine-tuned large language models

12/10/2022
by   Alexander Dunn, et al.
0

Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/30/2021

MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction

An overwhelmingly large amount of knowledge in the materials domain is g...
research
04/26/2023

Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Literature with GPT-3

Although gold nanorods have been the subject of much research, the pathw...
research
12/19/2022

Enriching Relation Extraction with OpenIE

Relation extraction (RE) is a sub-discipline of information extraction (...
research
02/09/2023

Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models

Accurate and comprehensive material databases extracted from research pa...
research
08/11/2021

Extracting Semantics from Maintenance Records

Rapid progress in natural language processing has led to its utilization...
research
06/03/2022

Plumber: A Modular Framework to Create Information Extraction Pipelines

Information Extraction (IE) tasks are commonly studied topics in various...
research
02/22/2023

Learning from Multiple Sources for Data-to-Text and Text-to-Data

Data-to-text (D2T) and text-to-data (T2D) are dual tasks that convert st...

Please sign up or login with your details

Forgot password? Click here to reset