MolXPT: Wrapping Molecules with Text for Generative Pre-training

05/18/2023
by   Zequn Liu, et al.
0

Generative pre-trained Transformer (GPT) has demonstrates its great success in natural language processing and related techniques have been adapted into molecular modeling. Considering that text is the most important record for scientific discovery, in this paper, we propose MolXPT, a unified language model of text and molecules pre-trained on SMILES (a sequence representation of molecules) wrapped by text. Briefly, we detect the molecule names in each sequence and replace them to the corresponding SMILES. In this way, the SMILES could leverage the information from surrounding text, and vice versa. The above wrapped sequences, text sequences from PubMed and SMILES sequences from PubChem are all fed into a language model for pre-training. Experimental results demonstrate that MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet, performs comparably to the best model in text-molecule translation while using less than half of its parameters, and enables zero-shot molecular generation without finetuning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/12/2019

SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery

In drug-discovery-related tasks such as virtual screening, machine learn...
research
07/06/2022

Pre-training Transformers for Molecular Property Prediction Using Reaction Prediction

Molecular property prediction is essential in chemistry, especially for ...
research
06/13/2023

Automated 3D Pre-Training for Molecular Property Prediction

Molecular property prediction is an important problem in drug discovery ...
research
09/01/2023

Geometry-aware Line Graph Transformer Pre-training for Molecular Property Prediction

Molecular property prediction with deep learning has gained much attenti...
research
06/21/2023

Interactive Molecular Discovery with Natural Language

Natural language is expected to be a key medium for various human-machin...
research
06/17/2021

Dual-view Molecule Pre-training

Inspired by its success in natural language processing and computer visi...
research
10/10/2021

On Automatic Text Extractive Summarization Based on Graph and pre-trained Language Model Attention

Representing text as graph to solve the summarization task has been disc...

Please sign up or login with your details

Forgot password? Click here to reset