Molecular fingerprints, i.e. feature vectors describing atomistic neighborhood configurations, is an important abstraction and a key ingredient for data-driven modeling of potential energy surface and interatomic force. In this paper, we present the Density-Encoded Canonically Aligned Fingerprint (DECAF) fingerprint algorithm, which is robust and efficient, for fitting per-atom scalar and vector quantities. The fingerprint is essentially a continuous density field formed through the superimposition of smoothing kernels centered on the atoms. Rotational invariance of the fingerprint is achieved by aligning, for each fingerprint instance, the neighboring atoms onto a local canonical coordinate frame computed from a kernel minisum optimization procedure. We show that this approach is superior over PCA-based methods especially when the atomistic neighborhood is sparse and/or contains symmetry. We propose that the `distance' between the density fields be measured using a volume integral of their pointwise difference. This can be efficiently computed using optimal quadrature rules, which only require discrete sampling at a small number of grid points. We also experiment on the choice of weight functions for constructing the density fields, and characterize their performance for fitting interatomic potentials. The applicability of the fingerprint is demonstrated through a set of benchmark problems.
09/26/2017 ∙ by Yu-Hang Tang, et al. ∙ 0 ∙ share
Data-driven prediction of molecular properties presents unique challenges to the design of machine learning methods concerning data structure/dimensionality, symmetry adaption, and confidence management. In this paper, we present a kernel-based pipeline that can learn and predict the atomization energy of molecules with high accuracy. The framework employs Gaussian process regression to perform predictions based on the similarity between molecules, which is computed using the marginalized graph kernel. We discuss why the graph kernel, paired with a graph representation of the molecules, is particularly useful for predicting extensive properties. We demonstrate that using an active learning procedure, the proposed method can achieve a mean absolute error less than 1.0 kcal/mol on the QM7 data set using as few as 1200 training samples and 1 hour of training time. This is a demonstration, in contrast to common believes, that regression models based on kernel methods can be simultaneously accurate and fast predictors.
10/16/2018 ∙ by Yu-Hang Tang, et al. ∙ 0 ∙ share
Yu-Hang Tangis this you? claim profile