Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files

05/09/2023
by   Daniel Flam-Shepherd, et al.
0

Language models are powerful tools for molecular design. Currently, the dominant paradigm is to parse molecular graphs into linear string representations that can easily be trained on. This approach has been very successful, however, it is limited to chemical structures that can be completely represented by a graph – like organic molecules – while materials and biomolecular structures like protein binding sites require a more complete representation that includes the relative positioning of their atoms in space. In this work, we show how language models, without any architecture modifications, trained using next-token prediction – can generate novel and valid structures in three dimensions from various substantially different distributions of chemical structures. In particular, we demonstrate that language models trained directly on sequences derived directly from chemical file formats like XYZ files, Crystallographic Information files (CIFs), or Protein Data Bank files (PDBs) can directly generate molecules, crystals, and protein binding sites in three dimensions. Furthermore, despite being trained on chemical file sequences – language models still achieve performance comparable to state-of-the-art models that use graph and graph-derived string representations, as well as other domain-specific 3D generative models. In doing so, we demonstrate that it is not necessary to use simplified molecular representations to train chemical language models – that they are powerful generative models capable of directly exploring chemical space in three dimensions for very different structures.

READ FULL TEXT

page 2

page 3

page 6

page 10

page 11

page 12

page 13

page 14

research
08/16/2023

Atom-by-atom protein generation and beyond with language models

Protein language models learn powerful representations directly from seq...
research
12/06/2021

Keeping it Simple: Language Models can learn Complex Molecular Distributions

Deep generative models of molecules have grown immensely in popularity, ...
research
05/25/2023

Explainability Techniques for Chemical Language Models

Explainability techniques are crucial in gaining insights into the reaso...
research
09/02/2022

Exploiting Pretrained Biochemical Language Models for Targeted Drug Design

Motivation: The development of novel compounds targeting proteins of int...
research
09/15/2023

Mining Patents with Large Language Models Demonstrates Congruence of Functional Labels and Chemical Structures

Predicting chemical function from structure is a major goal of the chemi...
research
05/31/2019

SELFIES: a robust representation of semantically constrained graphs with an example application in chemistry

Graphs are ideal representations of complex, relational information. The...
research
09/25/2022

Modie Viewer: Protein Beasts and How to View Them

Understanding chemical modifications on proteins opens up further possib...

Please sign up or login with your details

Forgot password? Click here to reset