SELFormer: Molecular Representation Learning via SELFIES Language Models

04/10/2023
by   Atakan Yüksel, et al.
0

Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing (NLP) algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose; however, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model that utilizes a 100 in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based chemical language models, on predicting aqueous solubility of molecules and adverse drug reactions. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.

READ FULL TEXT
research
06/17/2021

Do Large Scale Molecular Language Representations Capture Important Structural Information?

Predicting chemical properties from the structure of a molecule is of gr...
research
06/21/2023

Interactive Molecular Discovery with Natural Language

Natural language is expected to be a key medium for various human-machin...
research
02/10/2020

Exploring Chemical Space using Natural Language Processing Methodologies for Drug Discovery

Text-based representations of chemicals and proteins can be thought of a...
research
08/13/2022

Cloud-Based Real-Time Molecular Screening Platform with MolFormer

With the prospect of automating a number of chemical tasks with high fid...
research
09/17/2023

Structure to Property: Chemical Element Embeddings and a Deep Learning Approach for Accurate Prediction of Chemical Properties

The application of machine learning (ML) techniques in computational che...
research
09/18/2021

MM-Deacon: Multimodal molecular domain embedding analysis via contrastive learning

Molecular representation learning plays an essential role in cheminforma...
research
03/21/2023

Difficulty in learning chirality for Transformer fed with SMILES

Recent years have seen development of descriptor generation based on rep...

Please sign up or login with your details

Forgot password? Click here to reset