Group SELFIES: A Robust Fragment-Based Molecular String Representation

11/23/2022
by   Austin Cheng, et al.
0

We introduce Group SELFIES, a molecular string representation that leverages group tokens to represent functional groups or entire substructures while maintaining chemical robustness guarantees. Molecular string representations, such as SMILES and SELFIES, serve as the basis for molecular generation and optimization in chemical language models, deep generative models, and evolutionary methods. While SMILES and SELFIES leverage atomic representations, Group SELFIES builds on top of the chemical robustness guarantees of SELFIES by enabling group tokens, thereby creating additional flexibility to the representation. Moreover, the group tokens in Group SELFIES can take advantage of inductive biases of molecular fragments that capture meaningful chemical motifs. The advantages of capturing chemical motifs and flexibility are demonstrated in our experiments, which show that Group SELFIES improves distribution learning of common molecular datasets. Further experiments also show that random sampling of Group SELFIES strings improves the quality of generated molecules compared to regular SELFIES strings. Our open-source implementation of Group SELFIES is available online, which we hope will aid future research in molecular generation and optimization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/31/2022

SELFIES and the future of molecular string representations

Artificial intelligence (AI) and machine learning (ML) are expanding in ...
research
02/27/2021

Generative chemical transformer: attention makes neural machine learn molecular geometric structures via text

Chemical formula is an artificial language that expresses molecules as t...
research
12/03/2022

Calibration and generalizability of probabilistic models on low-data chemical datasets with DIONYSUS

Deep learning models that leverage large datasets are often the state of...
research
05/25/2023

Explainability Techniques for Chemical Language Models

Explainability techniques are crucial in gaining insights into the reaso...
research
02/07/2023

Recent advances in the Self-Referencing Embedding Strings (SELFIES) library

String-based molecular representations play a crucial role in cheminform...
research
07/25/2022

SFILES 2.0: An extended text-based flowsheet representation

SFILES is a text-based notation for chemical process flowsheets. It was ...
research
09/23/2022

Overtwisting and Coiling Highly Enhances Strain Generation of Twisted String Actuators

Twisted string actuators (TSAs) have exhibited great promise in robotic ...

Please sign up or login with your details

Forgot password? Click here to reset