SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery

11/12/2019
by   Shion Honda, et al.
49

In drug-discovery-related tasks such as virtual screening, machine learning is emerging as a promising way to predict molecular properties. Conventionally, molecular fingerprints (numerical representations of molecules) are calculated through rule-based algorithms that map molecules to a sparse discrete space. However, these algorithms perform poorly for shallow prediction models or small datasets. To address this issue, we present SMILES Transformer. Inspired by Transformer and pre-trained language models from natural language processing, SMILES Transformer learns molecular fingerprints through unsupervised pre-training of the sequence-to-sequence language model using a huge corpus of SMILES, a text representation system for molecules. We performed benchmarks on 10 datasets against existing fingerprints and graph-based methods and demonstrated the superiority of the proposed algorithms in small-data settings where pre-training facilitated good generalization. Moreover, we define a novel metric to concurrently measure model accuracy and data efficiency.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/18/2023

MolXPT: Wrapping Molecules with Text for Generative Pre-training

Generative pre-trained Transformer (GPT) has demonstrates its great succ...
research
07/06/2022

Pre-training Transformers for Molecular Property Prediction Using Reaction Prediction

Molecular property prediction is essential in chemistry, especially for ...
research
03/02/2016

Molecular Graph Convolutions: Moving Beyond Fingerprints

Molecular "fingerprints" encoding structural information are the workhor...
research
11/26/2020

Molecular representation learning with language models and domain-relevant auxiliary tasks

We apply a Transformer architecture, specifically BERT, to learn flexibl...
research
11/10/2022

Probabilistic thermal stability prediction through sparsity promoting transformer representation

Pre-trained protein language models have demonstrated significant applic...
research
07/20/2023

Fractional Denoising for 3D Molecular Pre-training

Coordinate denoising is a promising 3D molecular pre-training method, wh...
research
06/20/2017

Most Ligand-Based Benchmarks Measure Overfitting Rather than Accuracy

Undetected overfitting can occur when there are significant redundancies...

Please sign up or login with your details

Forgot password? Click here to reset