IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System

09/03/2021
by   Daniel Campos, et al.
0

Like many scientific fields, new chemistry literature has grown at a staggering pace, with thousands of papers released every month. A large portion of chemistry literature focuses on new molecules and reactions between molecules. Most vital information is conveyed through 2-D images of molecules, representing the underlying molecules or reactions described. In order to ensure reproducible and machine-readable molecule representations, text-based molecule descriptors like SMILES and SELFIES were created. These text-based molecule representations provide molecule generation but are unfortunately rarely present in published literature. In the absence of molecule descriptors, the generation of molecule descriptors from the 2-D images present in the literature is necessary to understand chemistry literature at scale. Successful methods such as Optical Structure Recognition Application (OSRA), and ChemSchematicResolver are able to extract the locations of molecules structures in chemistry papers and infer molecular descriptions and reactions. While effective, existing systems expect chemists to correct outputs, making them unsuitable for unsupervised large-scale data mining. Leveraging the task formulation of image captioning introduced by DECIMER, we introduce IMG2SMI, a model which leverages Deep Residual Networks for image feature extraction and an encoder-decoder Transformer layers for molecule description generation. Unlike previous Neural Network-based systems, IMG2SMI builds around the task of molecule description generation, which enables IMG2SMI to outperform OSRA-based systems by 163 MACCS Fingerprint Tanimoto Similarity. Additionally, to facilitate further research on this task, we release a new molecule prediction dataset. including 81 million molecules for molecule description generation

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/26/2021

Predicting Aqueous Solubility of Organic Molecules Using Deep Learning Models with Varied Molecular Representations

Determining the aqueous solubility of molecules is a vital step in many ...
research
05/01/2022

Molecular Identification from AFM images using the IUPAC Nomenclature and Attribute Multimodal Recurrent Neural Networks

Despite being the main tool to visualize molecules at the atomic scale, ...
research
09/03/2019

Data-Driven Approach to Encoding and Decoding 3-D Crystal Structures

Generative models have achieved impressive results in many domains inclu...
research
05/28/2020

Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First Data Release

Researchers across the globe are seeking to rapidly repurpose existing d...
research
08/09/2020

Augmenting Molecular Images with Vector Representations as a Featurization Technique for Drug Classification

One of the key steps in building deep learning systems for drug classifi...
research
06/13/2019

FPScreen: A Rapid Similarity Search Tool for Massive Molecular Library Based on Molecular Fingerprint Comparison

We designed a fast similarity search engine for large molecular librarie...
research
01/12/2021

AI- and HPC-enabled Lead Generation for SARS-CoV-2: Models and Processes to Extract Druglike Molecules Contained in Natural Language Text

Researchers worldwide are seeking to repurpose existing drugs or discove...

Please sign up or login with your details

Forgot password? Click here to reset