Do Large Scale Molecular Language Representations Capture Important Structural Information?

06/17/2021
by   Jerret Ross, et al.
9

Predicting chemical properties from the structure of a molecule is of great importance in many applications including drug discovery and material design. Machine learning based molecular property prediction holds the promise of enabling accurate predictions at much less complexity, when compared to, for example Density Functional Theory (DFT) calculations. Features extracted from molecular graphs, using graph neural nets in a supervised manner, have emerged as strong baselines for such tasks. However, the vast chemical space together with the limited availability of labels makes supervised learning challenging, calling for learning a general-purpose molecular representation. Recently, pre-trained transformer-based language models (PTLMs) on large unlabeled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, here we present molecular embeddings obtained by training an efficient transformer encoder model, referred to as MoLFormer. This model was employed with a linear attention mechanism and highly paralleized training on 1D SMILES sequences of 1.1 billion unlabeled molecules from the PubChem and ZINC datasets. Experiments show that the learned molecular representation performs competitively, when compared to existing graph-based and fingerprint-based supervised learning baselines, on the challenging tasks of predicting properties of QM8 and QM9 molecules. Further task-specific fine-tuning of the MoLFormerr representation improves performance on several of those property prediction benchmarks. These results provide encouraging evidence that large-scale molecular language models can capture sufficient structural information to be able to accurately predict quantum chemical properties and beyond.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/10/2023

SELFormer: Molecular Representation Learning via SELFIES Language Models

Automated computational analysis of the vast chemical space is critical ...
research
07/25/2023

Curvature-based Transformer for Molecular Property Prediction

The prediction of molecular properties is one of the most important and ...
research
09/03/2022

TransPolymer: a Transformer-based Language Model for Polymer Property Predictions

Accurate and efficient prediction of polymer properties is of great sign...
research
08/30/2023

Materials Informatics Transformer: A Language Model for Interpretable Materials Properties Prediction

Recently, the remarkable capabilities of large language models (LLMs) ha...
research
09/15/2023

Mining Patents with Large Language Models Demonstrates Congruence of Functional Labels and Chemical Structures

Predicting chemical function from structure is a major goal of the chemi...
research
07/07/2020

ASGN: An Active Semi-supervised Graph Neural Network for Molecular Property Prediction

Molecular property prediction (e.g., energy) is an essential problem in ...
research
01/04/2023

Anonymous Pattern Molecular Fingerprint and its Applications on Property Identification

Molecular fingerprints are significant cheminformatics tools to map mole...

Please sign up or login with your details

Forgot password? Click here to reset