Stepping Back to SMILES Transformers for Fast Molecular Representation Inference

12/26/2021
by   Wenhao Zhu, et al.
0

In the intersection of molecular science and deep learning, tasks like virtual screening have driven the need for a high-throughput molecular representation generator on large chemical databases. However, as SMILES strings are the most common storage format for molecules, using deep graph models to extract molecular feature from raw SMILES data requires an SMILES-to-graph conversion, which significantly decelerates the whole process. Directly deriving molecular representations from SMILES is feasible, yet there exists a performance gap between the existing unpretrained SMILES-based models and graph-based models at large-scale benchmark results, while pretrain models are resource-demanding at training. To address this issue, we propose ST-KD, an end-to-end SMILES Transformer for molecular representation learning boosted by Knowledge Distillation. In order to conduct knowledge transfer from graph Transformers to ST-KD, we have redesigned the attention layers and introduced a pre-transformation step to tokenize the SMILES strings and inject structure-based positional embeddings. Without expensive pretraining, ST-KD shows competitive results on latest standard molecular datasets PCQM4M-LSC and QM9, with 3-14× inference speed compared with existing graph models.

READ FULL TEXT

page 2

page 7

page 8

page 9

page 10

page 11

page 12

page 13

research
01/04/2023

Fragment-based t-SMILES for de novo molecular generation

At present, sequence-based and graph-based models are two of popular use...
research
05/16/2022

Chemical transformer compression for accelerating both training and inference of molecular modeling

Transformer models have been developed in molecular science with excelle...
research
10/19/2020

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction

GNNs and chemical fingerprints are the predominant approaches to represe...
research
12/15/2020

High throughput screening with machine learning

This study assesses the efficiency of several popular machine learning a...
research
09/08/2023

3D Denoisers are Good 2D Teachers: Molecular Pretraining via Denoising and Cross-Modal Distillation

Pretraining molecular representations from large unlabeled data is essen...
research
11/25/2020

Attention-Based Learning on Molecular Ensembles

The three-dimensional shape and conformation of small-molecule ligands a...
research
05/30/2019

All SMILES VAE

Variational autoencoders (VAEs) defined over SMILES string and graph-bas...

Please sign up or login with your details

Forgot password? Click here to reset