Pipelines for Procedural Information Extraction from Scientific Literature: Towards Recipes using Machine Learning and Data Science

12/16/2019
by   Huichen Yang, et al.
0

This paper describes a machine learning and data science pipeline for structured information extraction from documents, implemented as a suite of open-source tools and extensions to existing tools. It centers around a methodology for extracting procedural information in the form of recipes, stepwise procedures for creating an artifact (in this case synthesizing a nanomaterial), from published scientific literature. From our overall goal of producing recipes from free text, we derive the technical objectives of a system consisting of pipeline stages: document acquisition and filtering, payload extraction, recipe step extraction as a relationship extraction task, recipe assembly, and presentation through an information retrieval interface with question answering (QA) functionality. This system meets computational information and knowledge management (CIKM) requirements of metadata-driven payload extraction, named entity extraction, and relationship extraction from text. Functional contributions described in this paper include semi-supervised machine learning methods for PDF filtering and payload extraction tasks, followed by structured extraction and data transformation tasks beginning with section extraction, recipe steps as information tuples, and finally assembled recipes. Measurable objective criteria for extraction quality include precision and recall of recipe steps, ordering constraints, and QA accuracy, precision, and recall. Results, key novel contributions, and significant open problems derived from this work center around the attribution of these holistic quality measures to specific machine learning and inference stages of the pipeline, each with their performance measures. The desired recipes contain identified preconditions, material inputs, and operations, and constitute the overall output generated by our computational information and knowledge management (CIKM) system.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/20/2016

Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science

As the field of data science continues to grow, there will be an ever-in...
research
10/27/2017

New Methods for Metadata Extraction from Scientific Literature

Within the past few decades we have witnessed digital revolution, which ...
research
08/18/2023

Accelerated materials language processing enabled by GPT

Materials language processing (MLP) is one of the key facilitators of ma...
research
12/11/2022

MORTY: Structured Summarization for Targeted Information Extraction from Scholarly Articles

Information extraction from scholarly articles is a challenging task due...
research
04/05/2020

Natural language processing for word sense disambiguation and information extraction

This research work deals with Natural Language Processing (NLP) and extr...
research
09/19/2023

Semi-automatic staging area for high-quality structured data extraction from scientific literature

In this study, we propose a staging area for ingesting new superconducto...
research
12/04/2018

Information Extraction Framework to Build Legislation Network

This paper concerns an Information Extraction process for building a dyn...

Please sign up or login with your details

Forgot password? Click here to reset