Chemical Names Standardization using Neural Sequence to Sequence Model

01/21/2019
by   Junlang Zhan, et al.
0

Chemical information extraction is to convert chemical knowledge in text into true chemical database, which is a text processing task heavily relying on chemical compound name identification and standardization. Once a systematic name for a chemical compound is given, it will naturally and much simply convert the name into the eventually required molecular formula. However, for many chemical substances, they have been shown in many other names besides their systematic names which poses a great challenge for this task. In this paper, we propose a framework to do the auto standardization from the non-systematic names to the corresponding systematic names by using the spelling error correction, byte pair encoding tokenization and neural sequence to sequence model. Our framework is trained end to end and is fully data-driven. Our standardization accuracy on the test dataset achieves 54.04 which has a great improvement compared to previous state-of-the-art result.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/17/2018

Chemical Structure Elucidation from Mass Spectrometry by Matching Substructures

Chemical structure elucidation is a serious bottleneck in analytical che...
research
11/13/2017

"Found in Translation": Predicting Outcomes of Complex Organic Chemistry Reactions using Neural Sequence-to-Sequence Models

There is an intuitive analogy of an organic chemist's understanding of a...
research
03/27/2019

Multilevel Text Normalization with Sequence-to-Sequence Networks and Multisource Learning

We define multilevel text normalization as sequence-to-sequence processi...
research
02/19/2022

Image-to-Graph Transformers for Chemical Structure Recognition

For several decades, chemical knowledge has been published in written te...
research
04/28/2022

A Probabilistic Chemical Programmable Computer

The exponential growth of the power of modern digital computers is based...
research
07/25/2022

SFILES 2.0: An extended text-based flowsheet representation

SFILES is a text-based notation for chemical process flowsheets. It was ...
research
03/18/2020

TTTTTackling WinoGrande Schemas

We applied the T5 sequence-to-sequence model to tackle the AI2 WinoGrand...

Please sign up or login with your details

Forgot password? Click here to reset