Semi-automatic staging area for high-quality structured data extraction from scientific literature

09/19/2023
by   Luca Foppiano, et al.
0

In this study, we propose a staging area for ingesting new superconductors' experimental data in SuperCon that is machine-collected from scientific articles. Our objective is to enhance the efficiency of updating SuperCon while maintaining or enhancing the data quality. We present a semi-automatic staging area driven by a workflow combining automatic and manual processes on the extracted database. An anomaly detection automatic process aims to pre-screen the collected data. Users can then manually correct any errors through a user interface tailored to simplify the data verification on the original PDF documents. Additionally, when a record is corrected, its raw data is collected and utilised to improve machine learning models as training data. Evaluation experiments demonstrate that our staging area significantly improves curation quality. We compare the interface with the traditional manual approach of reading PDF documents and recording information in an Excel document. Using the interface boosts the precision and recall by 6 average increase of 40

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/02/2020

Identifying Documents In-Scope of a Collection from Web Archives

Web archive data usually contains high-quality documents that are very u...
research
10/27/2017

New Methods for Metadata Extraction from Scientific Literature

Within the past few decades we have witnessed digital revolution, which ...
research
07/21/2022

Comparative Study on Supervised versus Semi-supervised Machine Learning for Anomaly Detection of In-vehicle CAN Network

As the central nerve of the intelligent vehicle control system, the in-v...
research
05/15/2018

Corpus Conversion Service: A machine learning platform to ingest documents at scale [Poster abstract]

Over the past few decades, the amount of scientific articles and technic...
research
03/05/2021

BOPI: A Programming Interface For Reuse Of Research Data Available On DSpace Repositories

A recent study showed that more than 70 their peers's experiments and mo...
research
12/16/2019

Pipelines for Procedural Information Extraction from Scientific Literature: Towards Recipes using Machine Learning and Data Science

This paper describes a machine learning and data science pipeline for st...
research
09/25/2017

Towards automation of data quality system for CERN CMS experiment

Daily operation of a large-scale experiment is a challenging task, parti...

Please sign up or login with your details

Forgot password? Click here to reset