SMILES-X: autonomous molecular compounds characterization for small datasets without descriptors

06/20/2019
by   Guillaume Lambard, et al.
0

In materials science and related fields, small datasets (≪1000 samples) are common. Our ability to characterize compounds is therefore highly dependent on our theoretical and empirical knowledge, as our aptitude to develop efficient human-engineered descriptors, when it comes to use conventional machine learning algorithms to infer their physicochemical properties. Additionally, deep learning techniques are often neglected in this case due to the common acceptance that a lot of data samples are needed. In this article, we tackle the data scarcity of molecular compounds paired to their physicochemical properties, and the difficulty to develop novel task-specific descriptors, by proposing the SMILES-X. The SMILES-X is an autonomous pipeline for the characterization of molecular compounds based on a {Embed-Encode-Attend-Predict} neural architecture processing textual data, a data-specific Bayesian optimization of its hyper-parameters, and an augmentation of small datasets naturally coming from the non-canonical SMILES format. The SMILES-X shows new state-of-the-art results in the inference of aqueous solubility (RMSE_test≃ 0.57 ± 0.07 mols/L), hydration free energy (RMSE_test≃ 0.81 ± 0.22 kcal/mol, ∼ 24.5% better than from molecular dynamics simulations), and octanol/water distribution coefficient (RMSE_test≃ 0.59 ± 0.02 for LogD at pH 7.4) of molecular compounds. The SMILES-X is intended to become an important asset in the toolkit of materials scientists and chemists for autonomously characterizing molecular compounds, and for improving a task-specific knowledge through hereby proposed interpretations of the outcomes. The source code for the SMILES-X is available at https://github.com/GLambard/SMILES-Xgithub.com/GLambard/SMILES-X.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/22/2023

Molecular geometric deep learning

Geometric deep learning (GDL) has demonstrated huge power and enormous p...
research
10/04/2022

One Transformer Can Understand Both 2D 3D Molecular Data

Unlike vision and language data which usually has a unique format, molec...
research
05/08/2022

FP-GNN: a versatile deep learning architecture for enhanced molecular property prediction

Deep learning is an important method for molecular design and exhibits c...
research
11/01/2018

Independent Vector Analysis for Data Fusion Prior to Molecular Property Prediction with Machine Learning

Due to its high computational speed and accuracy compared to ab-initio q...
research
12/06/2020

Bayesian Modeling of Spatial Molecular Profiling Data via Gaussian Process

The location, timing, and abundance of gene expression (both mRNA and pr...
research
05/29/2023

Shift-Robust Molecular Relational Learning with Causal Substructure

Recently, molecular relational learning, whose goal is to predict the in...

Please sign up or login with your details

Forgot password? Click here to reset