1 Introduction
Solid catalysts form the backbone of the chemical industry and the hydrocarbonbased energy sector. The ability to convert solar energy to fuels and chemicals calls for new catalysts, as does the development of a sustainable chemical industry that is based on biomass and other nonfossil building blocks.^{1}
In efforts to accelerate the development of new materials, the chemical sciences have recently begun to adopt advanced machine learning methods, motivated by their success in other industries over the last 20 years.^{2, 3, 4, 5, 6}
Modern machine learning (ML) methods have the capacity to fit complex functions in highdimensional feature spaces while selecting the most relevant descriptors automatically. Artificial neural networks
^{7, 8, 4}(NN) and kernel ridge regression
^{9} (KRR) have become particularly popular.^{2, 3, 10, 11, 12, 13} We will discuss applications of Gaussian Processes ^{14} (GP) within this field^{15, 16, 17} and study the ability of a few different machine learning models for the specific application of catalysis informatics and surface science.We aim to establish systematic work flows in creating the best model given atomistic data, relevant for surface science and catalysis. Furthermore we aim to allow for increasing the level of autonomy in databased empirical models in order to fully utilize the predictive power and build models that may not be obvious based on current chemical intuition.
We view machine learning model building for surface science, as a three stage process. 1) Featurization of atomic structures, 2) predictive model training and systematic descriptor selection, and 3) deployment to advanced learning algorithms such as active learning. In the context of advanced learning algorithms, we discuss the importance of leveraging the uncertainty estimates within predictions from databased models. Gaussian processes (GP) will be the primary regression model in this work due to the size of the data sets and due to the high performance of GP’s in estimating uncertainties, which is a highly desirable functionality to leverage in active learning.
For the purpose of future machine learning model building, we present the CatLearn repository version 1.0.0,^{18} freely available under GNU Public License 3.0. Atomic structures are handled by importing the Atomic Simulation Environment (ASE).^{19} Furthermore, CatLearn has builtin compatibility with the CatKit package for graphbased atomic structure enumeration for surface science and catalysis.^{20}
The CatLearn code is set up in a modular fashion to be easily extensible. There are a number of modules that can build an optimal regression model in an automated manner. In the following we will present the functionality and contents of CatLearn and we will present an example of machine learning model building in the spirit of beyond userassumed intuition.
1.1 Regression
Regressors, by definition, predict continuous variables, in the present context typically properties which would require an electronic structure calculation to obtain. Current examples are adsorbate energies, formation energies or potential energies.
As a first iteration in regression model building, it is often recommended to try the most simple and computationally light model, such as a multivariate linear model. In this work we apply LASSO^{21} or Ridge^{22} regularized linear regressors as benchmarks.^{23} Since they are very fast compared to kernel regression methods, we also test LASSO as a descriptor selector for more flexible models, including Gaussian Processes.
A Gaussian Process Regressor (GP) has been implemented in CatLearn with specific extra functionality allowing for derivative of observations, such as forces, in case the target is potential energy. The GP is presented in detail in the following.
1.1.1 Gaussian Processes Regression
The GP is written as in Equ. 1.
(1) 
Where the mean function () is typically taken to be zero. The kernel trick is used to translate the input space into feature space with the covariance function
. The kernel is applied to determine relationships between the descriptor vectors for candidates
and .Name  Kernel 

Linear  
Squared Exponential 
The linear kernel is used, a constant offset is defined, determining the xcoordinate that all lines in the posterior must go though. There is also the additional scaling factor , giving the prior on the hight of the function at 0. The squared exponential kernel is commonly utilized for the mapping function upon continuous features, where is the Euclidean L^{2}norm and defines the kernel length scale. The length scale can be defined in a single dimension for the entire ndimensional feature space or for ndimensions of the ndimensional feature space. A scaling factor
is applied, typically set at approximately the standard deviation of the descriptor. It is beneficial to optimize a the scaling factor and length scale in each feature dimension of the fingerprint vector allowing for the anisotropic form of the kernel, which also adds automatic relevance determination capability to the model, since less relevant features are deactivated in the high length scale limit.
Kernels can furthermore be combined as in Equ. 2.
(2) 
In the following, we refer to being the matrix accumulation of all the feature vectors for candidates in a training or test (denoted and ) dataset. is the covariance, or Gram, matrix for the noisy target values , the conditional distribution is given by the mean and covariance as in Equ. 3 and 4.
(3) 
(4) 
(5) 
The covariance vectors between new data points and the training dataset are given by , while . The predicted uncertainty is then given by . The first term applies the predicted noise to the uncertainty with being the optimized regularization strength for the training data.
Optimization of all hyperparameters is performed through maximizing the log marginal likelihood, as in Equ.
6.(6) 
To take advantage of a gradient based optimizer, the partial derivatives with respect to hyperparameter can be calculated as in Equ. 7.
(7) 
Where .^{14} In the following parts of this paper, we will present several applications of the Gaussian Process.
2 Active Learning
In the context of machine learning, active learning
is used to efficiently explore large spaces of unlabeled data. This algorithm relies on a regression or classification method to cheaply predict the target label of unseen data. The active learning algorithm has the functionality to acquire a target label for an unseen data point, by evaluating the target function, which is often expensive. A decision about which data point to label is made by a specialized decision making function, also referred to as the acquisition function, which accepts the predictions from the regression model as input. The algorithm acquires data with a high probability to satisfy a testing criteria. Essentially, active learning algorithms turns search problems into efficient optimization problems, thereby tremendously reducing the computational cost in search problems by labeling as few examples as possible.
The CatLearn code includes several active learning algorithms to efficiently collect data of atomic structures related to catalysis, targeting efficient exploration of nano particle structures ^{25}, transition state search ^{26}, and atomic structure optimization,^{27} the latter will be introduced in the following.
2.1 Atomic structure optimization using active learning
Here, we present MLmin, an active learning optimizer to perform energy minimization of atomic structures.
Similarly to GPMin ^{27}, the predicted capabilities of our algorithm rely on training and testing a Gaussian Process (GP), making it a Bayesian optimizer.
MLMin is built to interact with ASE^{19} in a seamless manner, preserving all the properties of the ASE inputs and outputs included in the Atoms objects and trajectory files (e.g. constraints and the initial information of the systems).
Our GPR model is built with the Cartesian coordinates of the atoms (x = [, … , ]) and their corresponding energies (y = [, … , ]) and the first derivative observations of the energy with respect to the positions ( = [, … , ]). The predicted function is a priori defined as the Gaussian process, seen in equation 1. For the purpose of using the GP for atomic structure optimization, we chose a constant prior of the form and the square exponential kernel (SE).
When incorporating first derivative observations to the GP, the covariance matrix takes the form
The mean of the process are obtained as
(8) 
and the variance for new data () from the training data () is obtained as
(9) 
where with being the regularization parameter and is a vector that combines energies and first derivative observations as = [y ]. Our algorithm utilizes the variance to prevent the evaluation of structures that are far from the previous datapoints in the training set. For this purpose we allow the user to set a maximum step criteria based on the uncertainty of the GP. The geometry optimization in the predicted landscape is stopped if the maximum uncertainty threshold is reached (by default this parameter is set to ). This early stopping criterion helps to avoid the evaluation of unrealistic structures, for instance, when atoms of the model are too close to each other.
Note that fitting a GP with first derivative observations requires building and inverting a positive definite matrix, where is the number of dimensions and is the number of training datapoints. In order to alleviate the numerical explosion of the covariance matrix with the increasing number of datapoints and dimensions, we sparse the covariance matrix by excluding the coordinates and force of the atoms that are not relaxed. This operation has no impact on the accuracy of the GP predictions since it eliminates rows and columns with zero values, therefore with no correlation.
We perform a constrained optimization of the hyperaparamters each time that a point is added to the training set, fixing the parameter and optimizing the lengthscale () in a common dimension. The regularization term () that is added to the diagonal of the covariance matrix is separately optimized for the elements involving the kernel function () and the first derivative () terms of the kernels. The boundaries set for the hyperparameter optimization are:
(10) 
(11) 
(12) 
We found these parameters to perform well with atoms systems, however, CatLearn’s regression module allows the user to decide the boundaries of the hyperparameters.
A comparison between a gradient based minimizer and our MLMin algorithm is shown in Fig. 2. The behaviour of these two methods is illustrated using a series of snapshots taken at different stages of the optimization of a singleparticle in a twodimensional potential (MüllerBrown).^{28} The optimization is converged when the maximum force of the particle is below 0.01 eV/Å. In this example we used the BFGS algorithm (as implemented in ASE)^{19} to represent the behaviour of a gradient based minimizer (Figs. 2ac). This algorithm rely on calculating the atomic forces at every iteration, which is usually problematic when the cost of evaluating the objective function is expensive (e.g. when performing ab initio calculations). The crosses in Figs. 2ac shows the positions where a function call has been made in the MüllerBrown potential while minimizing the energy of the particle. One the one hand, ASEBFGS required a total of 22 function evaluations (Fig. 2c) to satisfy the converge criteria. The same initial guess was used to test the performance of the FIRE algorithm,^{29} which required 53 function calls to reach the convergence threshold. On the other hand, the machine learning accelerated algorithm (MLMin) converged to the same minimum performing only 7 function evaluations (Figs. 2dg). In order to achieve this acceleration, MLMin operates as follow: (1) A singlepoint calculation is performed in the position of the initial guess (cross in Fig. 2d) and a GP is trained with the Cartesian coordinates of the structure and their corresponding force and energy values. The resulting predicted potential after training the GP with only one training point is shown in Fig. 2d. (2) A local optimizer (such as BFGS or FIRE) is used to optimized the structure in the GP predicted potential (see red circles in Fig. 2d). (3) An acquisition function selects the tested point with the highest score. As default, the highest score point for MLMin is the tested structure with the lowest predicted energy (marked with a white circle in Fig. 2d). (4) The structure with the highest score is evaluated and its position, forces and energy values are appended to the training dataset. (5) The GP model is retrained with the new dataset, obtaining a more accurate predicted potential. (6) The algorithm performs steps (2) to (5) until the forces of the evaluated structure satisfy the convergence criteria. Figs. 2dg show the evolution of the predicted GP potential after (a) one, (b) two, (c) five and (d) seven function evaluations. Note that the model is not accurately describing the PES when the GP process is trained with a few datapoints (Figs. 2de) but the accuracy on the predictions substantially improves after 5 function evaluations. The GPR process trained with these 5 datapoints allows to obtain a the predicted PES that nearly resembles the actual PES from DFT (compare Fig. 2c and Fig. 2f).
We performed a DFT benchmark to test the performance of our machine learning algorithm when optimizing atomic structures using the periodic planewave DFT code VASP.^{30}
The performance of our algorithm (MLMin) is tested on six different atomic systems including gas phase, bulk, surfaces, interfaces and nanostructure calculations.
In particular, the optimization of a pentene molecule in the gas phase, a Cu(110) bulk, a Cu(100) slab, the adsorption of C on Cu(100), the adsorption of CO on Au(111), and a AgCu nanoparticle.
Our benchmark reports the number of function evaluations (force calls) required to optimize each system.
The MLMin results are contrasted to the ones obtained from BFGS,^{31} FIRE^{29} and GPMin^{27} algorithms (as implemented in ASE).^{19}
The convergence threshold for this benchmark is reached when the forces of the relaxed atoms are bellow 0.01 eV/Å.
To obtain a statistical analysis of the performance of the different algorithms, we initialize the optimizations with 10 different initial geometries for each system. In order to do that, we start the groundstate geometry of each system is randomly perturbed with an amplitude of 0.10 standard deviation noise to generate the initial guesses. We assume 10 samples is enough to give a fair comparison of the different algorithms.
The results of the benchmark calculations in Figure 3 show that the MLMin algorithm is superior in performance than the gradientbased algorithms. The MLMin algorithm presents the best ”worst performance” values for each system studied here (compare upper whiskers of each boxplot in Figure 3). In terms of stability, the MLMin and BFGS algorithms were able to converge all the structures in less than 200 force calls, which is the maximum number of force calls imposed in this benchmark. In contrast, FIRE and GPMin presented issues converging the pentane system, where only 9 out of 10 structures were converged (see number of white circles in Figure 3). In this case, FIRE reached the maximum number of steps whilst the GPMin algorithm fails to predict a lower energy, than the last step and stops.
The major success of the MLassisted algorithms rely on building model that can locally resemble the actual PES, using just a few observations. However, the predictions might not be accurate at the first stages of the surrogate model, i.e. when a few geometries with their corresponding energies and forces are used to train the model. In particular, this can be more pronounced when dealing with the optimization of structures with many degrees of freedom, e.g. the optimization of the pentane molecule (Figure
3) which contains 18 relaxed atoms, thus, 54 degrees of freedom. This can drive the surrogate towards evaluating unrealistic structures, which can cause the force calculator to fail, making the algorithm unable to gather more training data. To the best of our knowledge we are pioneers on utilizing both predictions and uncertainty estimates from the GP model to drive the optimization process towards convergence in a fewer number of function evaluations than the other bayesian and nonbayesian local atomic structure optimizers.3 General Descriptors for Atomistic Data
Applications of machine learning for screening in catalysis or for global structure exploration require featurizing in order to create numeric representations from more general sets of atomic structures. Regression or classification models only accept a vector, called a fingerprint as representation for each observation or training example. Each element in a fingerprint represents a feature. The sets of features that are used as descriptors in machine learning algorithms have to be designed by domain experts and they are an essential part of creating useful models. Atomic structure optimizers, such as MLNEB and MLmin reshapes the atomic coordinate array into fingerprint vectors, where the coordinates of each atom is a feature, e.g.
For general atomistic datasets, a wide variety of feature sets exist in the scientific literature for representing molecules, solids and adsorbate/slab structures in the scientific literature.^{32} We have implemented some of the most relevant sets in CatLearn either explicitly or through wrappers that depend on external repositories.^{33} In the following we briefly present some of the feature sets that can be returned from the CatLearn code.
Machine learning data sets are typically represented as an by array, i.e. a list of N fingerprints, that contain features.^{34} CatLearn is therefore built to transform a list of ASE Atoms objects into the full data array, as shown in the Python snippet below. A list of functions defining the features must also be passed.
The user can easily extend the feature set with builtin or userdefined features by appending corresponding functions to feature_functions, (See CatLearn^{18} tutorial 2).
The full list and documentation for currently implemented feature functions will be kept updated at https://catlearn.readthedocs.io/en/latest/catlearn.fingerprint.html#. In the following we present an overview of the features that are currently implemented.
3.0.1 Chemical formula parsing
String parsing and chemical formula parsing features are implemented, allowing for the production of features where chemical formulas have been recorded for atomic structures or relevant subsets thereof, such as a binding site or an adsorbate. Chemical formula of atomic models are almost always recorded for atomicscale modeling data sets and they are relevant for legacy data on adsorption energies.^{35}
The chemical formula string featurizers fold the compositions with periodic table properties, adapted from the Mendeleev Python library.^{36} The features are thus averages of elemental properties, weighed to the composition, or they are sums, minimums, maximums, or medians of elemental properties from the composition. As an example, a summed property fingerprint is defined in equation 13.
(13) 
where is the atomic number of atom , and is a property such as the electronegativity of atom . The number of outer electrons, , is one example of an atomic property feature, which has previously been shown useful for prediction of dband center of bimetals^{37} and for prediction of adsorption energies across bimetallic surfaces, keeping the adsorbate and site geometry constant.^{38}
Categorical features, such as an atom belonging to a block in the periodic table, are encoded by dummy variables, e.g. for a transition metal:
where a feature of value of 1 designates the category, and all other elements in the vector are 0 corresponding to other possible categories.
An advantage of chemical formula based features, is that they can be obtained without performing expensive electronic structure calculations, and thus making predictions based on such data is cheap.
As we will see below, a more detailed level of information about atomic structures can be described using connectivity information, such as neighborlists, and can therefore be represented using graphbased features. Such features are highly valuable, because they represent atomistic systems as graphs that are distinct from each other.
3.0.2 Graphbased features
For screening or global structure searches, a set of atomistic graphs can form a discrete and finite search space.^{20} Graph features are those based on the graph representation of an atomic structure. Such feature sets avoid reliance on the corresponding explicit 3D structures, which are often unknown until performing a structure optimization using full electronic structure calculations.
One of the implemented fingerprints comes from the autocorrelation functions^{39, 40}, which is a form of property convolution over the graph. The autocorrelation function is typically defined in equation 14.
(14) 
where a convolution of the atomic property is returned for all pairs of atoms whose connectivity distance is . In this way, a number of descriptor parameters equal to the number of neighbor shells, up to a user defined maximum are returned.
These types of feature sets provide very general representations of the atomic system, applicable for e.g. formation energies of molecules, nanoparticles, or bulk systems.
Yet, such electronic properties of atomic structures, which we typically want to predict, are functions of their 3D representation. To predict those properties, a mapping between graph and 3D structure must therefore be made. A graph can be made into a 3D structure unambiguously if the structure can be optimized to a local minimum unique to the graph.
For bulk systems, Voronoi tessellation has become a popular method for identifying neighbors.^{33} In structures that contain vacuum, such as molecules and slabs, a naive implementation of Voronoi tessellation does not always work and one usually have to rely on cutoff radii i.e. model the atoms as hard spheres that has to overlap for a connection to exist.
3.0.3 Atomic subsets
We include a set of features characterizing adsorbate systems, which is more specific to the local environment around an adsorbate and the adsorption site, but which are automatically derived from the connectivity matrix of the atomic structure. In many cases, only a particular subset of the atomic structure is relevant for the property. Adsorption energies is an example of a property, that typically depend just on a local environment around the adsorption site and on the adsorbate geometry.^{41, 42} We have implemented feature sets based on the following subsets of slab/adsorbate structures:

Atom(s) in the adsorbate.

Surface atoms (Atoms not in group A).

Surface atom(s) bonded to adsorbate atom(s) (the binding site).

Slab atoms neighbouring the site atoms (first neighbor shell).

Surface termination atoms (outermost atomic layer).

Subsurface atoms (Atoms which are part of group B, but not group E).

Adsorbate atom(s) bonded to surface atoms.
The subsets are also visualized in Fig. 4.
The elemental composition of the groups are parsed using chemical formula parsing as explained above. An example of these feature functions is the average properties of subset C, which corresponds to work by Li et al. which took the average electronegativity of the site atoms as a descriptor for reactivity without explicitly calculated electronic structure information.^{12}
An additional set of features derived from properties of atomic subsets are the nominal geometric features, that rely on tabulated atomic radii to define average atomic or electronic density in plane and in volume. From these, one can also derive the nominal strain, which we specifically implemented as Equation 15.
(15) 
where the bar denotes averages across atoms in the subset, is the atomic radius of atoms in the bulk (subset F) and is the atomic radius of atoms in the slab termination (subset E).
Some features can also be derived from counting atoms in the subsets. As an example, the count of group C defines the type of adsorption site, where 1 corresponds to top, 2 corresponds to bridge, and so on. Counting unique elements in group D is the coordination number, , of the binding site, which identify the type of facet and has previously been shown to scale linearly with the reactivity of binding sites in metals.^{43}
Summing properties of the edges form another type of fingerprints. In this category we have implemented the squared difference in Pauling electronegativity, summed over edges of several atomic subsets, as defined in equation 16.
(16) 
where is the electronegativity of atom with index . can be interpreted as a measure of formation energy according to the original definition.^{44}
Another type of feature sets can be obtained by bagging a subset. We define bagging by equation 17.
(17) 
where is a vector with element index , is the index of each atom in the subset and is the Dirac delta function comparing a discrete property with index . The term bag originates from a bag of words, which is used in spam classification.^{45, 32} As an example, if is the atomic number, , of atom , then a bag of elements fingerprint vector would result, where the occurrence of each element in the subset is counted.
3.0.4 Bag of edges
Subsets of edges in the atomistic graphs can further be derived from atomic subsets. The currently implemented subsets of edges are:

Connections within the adsorbate atoms.

Connections between adsorbate atoms and site atoms.

First shell connections in the surface.

All connections made in formation of adsorbate: (AA)(AE).
Bagging CD yields the structuresensitive scaling descriptors,^{46} which we here refer to as a bag of coordination numbers, since they are a vector of the unique counts of coordination numbers of atoms neighboring the adsorption site, i.e. in equation 17, .
Averaging the coordination numbers weighted with respect to the bag of coordination numbers element, , yields the generalized coordination number,^{47, 43} as written in equation 18.
(18) 
where is the maximum allowed coordination number, e.g. for single atoms in FCC, .
3.1 Feature Engineering
Once a set of features has been generated, it is possible to use feature engineering to expand the feature set encoding certain relationships between features explicitly. This can be achieved based on combinatorial operations on any or all of the feature sets, by taking e.g. products of all features with one another. This is done to explicitly expand the feature space into regions that may not typically be considered, a number of functions are employed to combinatorially increase the feature space under consideration. Pairwise feature engineering is performed with the expressions in Equ. 19.
(19) 
Pairwise equations are applied when so as not to simply scale the features, as this will be reversed when the feature space is scaled. There are many ways in which the feature space could be expanded. Though with the resulting combinatorial explosion in the number of features to train a model with, it is typically important that this be employed along with feature elimination or dimensionality reduction methods. Inclusion of all features could be detrimental in two significant ways. Firstly, with the potential to have very many features compared to the number of data points, there could be risk of overfitting. Secondly, as the number of features increases, the cost of training the model will also increase. The routines utilized for optimizing the feature space are discussed in the Preprocessing section.
4 Preprocessing
4.1 Data cleaning
The regression models always only accept real numbers as input. Meanwhile, data sets generated by our fingerprinters may contain missing values or columns with zero variance and thus zero information. Descriptors which have zero variance always need to be removed. This is also a simple course of action for features containing infinite or missing values, but our clean finite function furthermore has an optional parameter to impute up to a fraction of values in a column, as an alternative to removing it.
Highly skewed features may also cause issues in some learning algorithms, and a function is therefore implemented to remove columns with skew higher than a user defined threshold.
4.2 Data scaling
Features are naturally on a variety of different numerical scales, but they need to be put on the same scale to be treated equally by learning algorithms.
Several methods of rescaling are implemented in CatLearn. Features with identical mean and variance can be obtained by standardization as defined in Equation 20.
(20) 
where is the original feature column with index , is the mean of that feature, is the standard deviation of that feature and is the resulting rescaled feature. Note that standardization does not work well for highly skewed features or features with extreme values, because the resulting standardized feature may contain extremely large magnitude minimum or maximum values. Figure 5 shows the distributions of standardized features.
It is possible to perform feature scaling in one of two modes, either locally or globally. When scaling locally, the training data is scaled, then the mean and standard deviation generated from that training data applied to the test data. When scaling globally, the training and testing data are concatenated and scaling metrics calculated on the entire data set. When standardizing the data, it is better to scale globally, however, this is not possible when the entire test space is not known prior to scaling. Other methods have been implemented, such as normalization, minmax and unitlength scaling which have not been used in the following studies but have various advantages and disadvantages.^{48}
5 Feature Selection
It is possible to generate a feature set based on an intuitive understanding of the underlying properties of the target. However, there are likely features that are not as prominently linked to the target property but could still be beneficial to include. Flexible regression models will often be able to locate relationships between features that user intuition of the chemical system may not suggest. It is, therefore, necessary to consider how best to allow for all beneficial features to be included in the feature space. With the automatic feature generators and possibly additional user defined features, it is easy for a user to come up with a very large number of potentially descriptive features, allowing for consideration of all available information about an atomistic data set.
There are three main problems with the approach of just including everything imaginable in the feature set. 1) As the number of features increases, the computational cost of training the model will also grow. 2) If features are correlated with each other, many local solutions to optimizing parameters or hyperparameters may exist, leading to a probability of training a suboptimal model. This issue may in some cases be alleviated by a principle component analysis (PCA) transform, which is further discussed below. 3) It is also possible that features are introduced that better describe noise than the actual target, leading to an overfit.
When training a model with limited numbers of observations and very complex underlying physics, such as in the case of atomistic data, it is more difficult to ascertain an optimal (or even just reasonable) feature set. The feature optimization routines described below can potentially aid with this problem.
5.0.1 Principle component analyses
Instead of simply eliminating features that are unlikely to add any appreciable knowledge to the model, it is also possible to utilize feature extraction methods for dimensionality reduction. We utilize principal component analysis (PCA) and partial least squares (PLS) for feature extraction. PCA identifies the most meaningful basis in which to modify the feature space with the aim of reducing noise and revealing hidden structure within the data. A selection of the largest principal components are taken as descriptors and given to the model. This procedure reduces the dimensionality of the problem but retains information from all of the original features.
PCA (and feature extraction in general) has advantages when it is necessary to reduce the total number of features, but feature elimination methodologies are unable to further identify descriptors to completely remove from consideration. Specifically, with PCA, the principal components under consideration will be independent of one another, ensuring each descriptor adds new information to the model. However, this comes at the expense of model interpretation. Once PCA has been performed on the features it is no longer possible to extract important physical properties from the model.
5.0.2 Elimination Analysis
To gain an understanding of the impact of eliminating one or more features, assessment may be performed using Sammon’s stress. This error is a measure of how eliminating features from the data set changes the Euclidean distance between data points. This measure is defined as in Equation 21.
(21) 
The distance between features and in the original feature space is given by
while gives the distance in the reduced set. Small stress means that the
reduced feature set maintains the distances between data points well. An elimination method based on Sammon’s stress is therefore not suitable for removing features that add noise, but it may be suitable for removing redundant features.
5.0.3 Correlation Elimination
Sure independence screening ^{49} (SIS) is one method by which to efficiently reduce the largest feature spaces, since they offer linear time complexity () with respect to number of features. The correlationbased elimination routines may can be too simplistic as they rely on linear or rank correlation, which misses intricacies of the nonlinear relationships adopted by nonlinear models such as the GP. Furthermore it does not capture usefulness of combinations of features, e.g. if the sum of several features is linearly correlated with the target. However, as expensive regression models can be avoided, at times, these can be some of the only feasible elimination methods. SIS analysis is calculated as in Equation 22.
(22) 
The resulting accounts for the Pearson correlation coefficients between features and targets. These coefficients are sorted in order of decreasing magnitude and the model derived from the descriptors with the largest correlation. Robust rank correlation screening ^{50} (RRCS) has also been utilized with either Kendall or Spearman’s rank correlation coefficient calculated to provide the resulting correlation between features and targets. Kendall correlation is calculated as in Equ 23.
(23) 
The number of concordant pairs () is the sum of when or when pairs. The number of discordant (), is the sum of when and when . Spearman’s rank correlation accounts for Pearson correlation between the ranked variables.
From the ordered coefficients it is possible to reduce the dimensions of the feature matrix to the number of data points. However, using an iterative method, it is possible to reduce the size of the feature space in a more robust manner. The general principle for iterative screening is that correlation between features should be accounted for. If features ordered , and correlate well with the target values, but and correlate with one another, it would likely be more beneficial to only include features and as feature would provide little additional information that wasn’t already included in . The residual correlation is calculated in an iterative manner based on accepted features as in Equ. 24.
(24) 
The above screening methods rely on a backward elimination procedure. Features that are expected to provide the least information to the model are removed. However, once the number of features has been reduced to the number of data points using either (iterative) SIS or RRCS, or if there were fewer features than data points initially, it is possible to use more expensive elimination methods.
5.0.4 Linear Coefficient Elimination
LASSO,^{21, 23} is a regularization method which trains a linear regressor using the cost function, , in Equation 25.
(25) 
where is the vector of training targets, are the coefficients contained in the coefficient vector , are the training descriptor vectors, and is the regularization strength.
This cost function penalizes coefficient magnitude all the way to zero, unlike ridge regression where coefficients are squared in the penalization term. Features are considered eliminated from the model, when their corresponding coefficients are exactly zero. LASSO regularization is applicable to parametrized models including polynomial regression, thus it can not be applied to highly flexible models such as deep nets or kernel regression, and using it for feature selection may cause a loss of nonlinear correlations from the data.
LASSO offers an advantage over correlation screening methods (SIS), since LASSO optimizes the model with respect to all coefficients simultaneously, thus theoretically finding a globally optimal subset of features, given the data.
5.0.5 Greedy Elimination
Greedy feature elimination is largely dependent on the regression routine it is coupled with. However, this will be more costly than the correlationbased methods but potentially allow for the inclusion of nonlinear relationships.
With the greedy selection, the feature set is iterated through, leaving out one feature vector at a time. The feature that results in either the greatest improvement in prediction accuracy or the lest degradation in accuracy, is eliminated. This can be coupled with crossvalidation to help improve robustness to preventing overfitting. However, it is potentially necessary to train a very large number of models. The greedy strategy is employed using ridge regression and a GP to optimize feature importance.
5.0.6 Sensitivity Elimination
Sensitivity elimination can achieve a good balance between accuracy and cost. A model is trained for each feature to be eliminated. Then the sensitivity of the model to a perturbation each feature is measured. The feature resulting in either an improvement in the accuracy or the smallest increase in error will likely be a feature that can be eliminated.
There are several methods for perturbing the selected feature, this can involve assigning random values for each data point, making a feature invariant, or shuffling the order of the data points for the selected feature.
5.0.7 Genetic Algorithm
A genetic algorithm GA for feature elimination has also been implemented within CatLearn. This allows for a global search to be performed with the aim of finding the optimal set of features. This should typically be more robust than the greedy elimination algorithm, where features can reenter the subset after they have been eliminated. Further, it is possible to perform multivariable optimization, locating a Paretooptimal set of solutions. This can optimize the cost function and e.g. training time simultaneously.
Feature elimination GA’s may be implemented with a variety of trial or permutation steps and constraints, leading to more or less explorative and more or less costly algorithms, depending on the data.
6 Benchmarking
In the following studies we compare the prediction score from regularized linear models with predictions from a GP relying on several different descriptor selection pipelines from the same data set, in order to obtain the most accurate model for one data set. The data represents atomic structures of adsorbates on bimetallic FCC (111) and HCP (0001) facets. The fingerprints rely on connectivity information only and include the following feature sets generated by the CatLearn: Averages and sums of elemental properties over the adsorbate atoms. Averages, medians, minima and maxima of elemental properties over subsets of the surface atoms. Additionally, the generalized coordination number, a bag of atoms and bag of edges in the adsorbate, and the sum of pairwise pauling electronegativity differences of edges in various subsets of the adsorbate and site atoms, as presented previously, in the section on graphbased fingerprints. The builtin functions to return these fingerprints are:
The target feature is the adsorbate energy with reference to an empty slab and gas phase references, as specified by Equation 26.
(26)  
where is the energy of the slab with an adsorbate containing , and , H, C and O atoms respectively. is the energy of the empty slab, , and are the energies of the gas phase references, H, HO and CH, respectively.
Linear correlations between individual features with the target will be discussed first for the original unfiltered features and be compared with the coefficients from the regularized linear models.
Subsequently, test scores for linear and GP regression model using each of the feature sets selected by the various algorithms will be presented. Finally, correlation analyses will be discussed for the best feature subset and compared to the full set.
6.1 Linear Model Benchmark
Here we present the results from (LASSO and ridge) regularized linear regression models, which includes automatic relevance determination for the descriptors. Regularization strengths of both Ridge and LASSO models were determined by 3fold cross validation.
The linear models serve as a benchmark for more flexible regression models, while the nonzero LASSO coefficients may also be a fast method for revealing relevant descriptors for other nonlinear models.
Observe first Figure 6, which shows the absolute Pearson correlation, , between each of the features with the target, . This clearly shows no single feature is highly correlated with the target, the maximum values being below , proving the need for a multidimensional or a nonlinear model for this data.
It is worth noting, that if we constrain ourselves to a linear model with one features, we would choose the feature with the maximum , which is the sum of number of outer electrons in the adsorbate with . Summing over number of outer electrons is intuitive since the target is formation energy (see Equation 26), and thus the sum of electrons contributes with a linear signal.
Comparing ridge and LASSO regularized models, we observe that all ridge coefficients are nonzero, while LASSO promotes spare feature sets and thus sets some coefficients to zero, resulting in a set of 222 descriptors. Features with a nonzero coefficient are highlighted in Figure 6. It is noteworthy that individual feature correlation seem to have little relevance for the size of their corresponding coefficients in the optimal regularized multivariate model. This in itself suggests that useful information is expected to be lost by the fast correlationbased feature elimination algorithm, SIS (See previous section on feature selection).
6.2 Results
Figure 7 shows a comparison of the final generalization scores across the different feature selection algorithms tested. We observe very similar scores across the pipelines with a GP regression model, whereas the linear models turn out to be biased.
Filtering out features makes the model sparse, and thereby faster to optimize, but did not improve the predictions.
The GP with an automatic relevance determination kernel (ARD) obtained the best generalization score with improvements of 50 meV on the MAE and of 60 meV on the RMSE. It is quite sensitive to hitting local minima in the log marginal likelihood surface, when a many intercorrelated features are present in the data, as seen in Fig. 8 (Left). This in practice increases the cost dramatically, because numerous runs with different starting guesses or use of minimahopping becomes necessary.
6.3 Correlation Analysis
Correlation analyses are presented in Figure 8 for the original unfiltered feature set and for the feature subset chosen by the sensitivity elimination algorithm (See the section on descriptor selection methods).
The sensitivity elimination algorithm has selected a feature set with fewer intercorrelated features, as well as obtaining a decent prediction score (See Figure 7). A few high pairwise correlations still remain in the descriptor set.
6.4 Feature expansions
Hypothetically, new correlations in the data, which are beyond the intuition of the user may be revealed by expanding the original features, as presented in the section Feature Engineering.
Feature expansions were performed from the original 295 features using the transforms , and combinatorial transforms, resulting in a data set with 44548 features. GP are too expensive and regularized linear models are are expected to be ineffective for such a large feature sets given the training set of around 2000 observations. Instead SIS was applied as a fast prescreening filter, to select a subset of size equivalent to the number of training observations.
The SIS prescreening was done with linear correlation (Pearson) monotonicfunction rank correlation (Spearman) and ordinal association rank correlation (Kendall). Thereafter, the data was fitted by a LASSO regularized linear model, a Ridge regression linear model, where the regularization strengths, , were determined by 3fold cross validation. A GP with anisotropic squared exponential kernel was also tested.
The resulting benchmark tests are shown in Figure 9. First of all, the combinatorial feature expansions followed by correlation elimination fails to surpass the linear models and the Gaussian process models on the original feature set, for which the scores are shown in Figure 7. Furthermore, we now observe that the GPR scores have more scatter than before, since the RMSE is now worse than the linear models, while the MAE has degraded from around 0.14 eV to 0.18 eV.
The SIS prescreening loose a lot of important information, which results in a degraded accuracy on the subsequent models. This is clear from observing that some active features, in the Ridge and LASSO models on the original data, were sometimes individually uncorrelated with the target, which would cause those features to be eliminated by SIS. In summary, we can not conclude that combinatiorial feature expansion approach followed by SIS prescreening gives any benefit to predictive accuracy.
7 Conclusion
Machine learning model building for surface science and catalysis is a rapidly advancing field. Probabilistic models such as Gaussian Processes’ ability to estimate uncertainty accurately, makes them an ideal choice for adhoc fitting and active learning on the potential energy surface. This has in 2018 materialized in highly efficient atomic structure optimizers, which are available in the presently presented code package.
For intermediate sized general atomistic data sets of a few thousand adsorbate/surface structures represented by their graphs, Gaussian Process regressors also perform well compared to linear models. A variety of automatic descriptorselection pipelines have been implemented and compared, showing the GP with automatic relevance determination of features to be a highly competitive solution. GP regressors do not scale easily to very large data sets due to the scaling of the GP. Deep nets (NN) or other flexible and scalable models are therefore likely to take over for larger datasets.
Domain expertise and development of new descriptors continue to be a driver of optimal and efficient model building, while the beyond userassumed approach remains a frontier in atomistic data science.
This work was supported by the U.S. Department of Energy, Chemical Sciences, Geosciences, and Biosciences (CSGB) Division of the Office of Basic Energy Sciences, via Grant DEAC0276SF00515 to the SUNCAT Center for Interface Science and Catalysis.
References
 Nørskov et al. 2009 Nørskov, J. K.; Bligaard, T.; Rossmeisl, J.; Christensen, C. H. Towards the computational design of solid catalysts. Nature chemistry 2009, 1, 37–46.
 Rupp et al. 2012 Rupp, M.; Tkatchenko, A.; Müller, K.R.; von Lilienfeld, O. A. Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning. Phys. Rev. Lett. 2012, 108, 058301.
 Khorshidi and Peterson 2016 Khorshidi, A.; Peterson, A. A. Amp: A modular approach to machine learning in atomistic simulations. Computer Physics Communications 2016, 207, 310 – 324.

Schütt et al. 2017
Schütt, K. T.; Arbabzadah, F.; Chmiela, S.; Müller, K. R.; Tkatchenko, A. Quantumchemical insights from deep tensor neural networks.
Nat. Commun. 2017, 8, 13890.  Medford et al. 2018 Medford, A. J.; Kunz, M. R.; Ewing, S. M.; Borders, T.; Fushimi, R. Extracting knowledge from data through catalysis informatics. ACS Catalysis 2018, 8, 7403–7429.
 Ramsundar et al. 2019 Ramsundar, B.; Eastman, P.; Leswing, K.; Walters, P.; Pande, V. Deep Learning for the Life Sciences; O’Reilly Media, 2019; https://www.amazon.com/DeepLearningLifeSciencesMicroscopy/dp/1492039837.
 Patterson 1998 Patterson, D. W. Artificial Neural Networks: Theory and Applications, 1st ed.; Prentice Hall PTR: Upper Saddle River, NJ, USA, 1998.
 Ma et al. 2015 Ma, X.; Li, Z.; Achenie, L. E.; Xin, H. Machinelearningaugmented chemisorption model for CO2 electroreduction catalyst screening. The journal of physical chemistry letters 2015, 6, 3528–3533.

Cristianini and ShaweTaylor 2000
Cristianini, N.; ShaweTaylor, J.
An Introduction to Support Vector Machines and Other Kernelbased Learning Methods
; Cambridge University Press, 2000; p 189.  Chiriki and Bulusu 2016 Chiriki, S.; Bulusu, S. S. Modeling of {DFT} quality neural network potential for sodium clusters: Application to melting of sodium clusters (Na20 to Na40). Chemical Physics Letters 2016, 652, 130 – 135.
 Kim et al. 2016 Kim, C.; Pilania, G.; Ramprasad, R. Machine Learning Assisted Predictions of Intrinsic Dielectric Breakdown Strength of ABX3 Perovskites. The Journal of Physical Chemistry C 2016, 120, 14575–14580.
 Li et al. 2017 Li, Z.; Ma, X.; Xin, H. Feature engineering of machinelearning chemisorption models for catalyst design. Catalysis Today 2017, 280, Part 2, 232 – 238.
 Peterson et al. 2017 Peterson, A. A.; Christensen, R.; Khorshidi, A. Addressing uncertainty in atomistic machine learning. Physical Chemistry Chemical Physics 2017, 19, 10978–10985.
 Rasmussen and Williams 2006 Rasmussen, C. E.; Williams, C. K. I. Gaussian processes for machine learning; MIT Press, 2006; p 248.
 Ulissi et al. 2016 Ulissi, Z. W.; Singh, A. R.; Tsai, C.; Nørskov, J. K. Automated Discovery and Construction of Surface Phase Diagrams Using Machine Learning. The Journal of Physical Chemistry Letters 2016, 7, 3931–3935.
 Ulissi et al. 2017 Ulissi, Z. W.; Medford, A. J.; Bligaard, T.; Nørskov, J. K. To address surface reaction network complexity using scaling relations machine learning and DFT calculations. Nat. Commun. 2017, 8, 14621.
 Ueno et al. 2016 Ueno, T.; Rhone, T. D.; Hou, Z.; Mizoguchi, T.; Tsuda, K. COMBO: an efficient Bayesian optimization library for materials science. Materials discovery 2016, 4, 18–21.
 18 CatLearn. https://github.com/SUNCATCenter/CatLearn.
 Larsen et al. 2017 Larsen, A. et al. The Atomic Simulation Environment A Python library for working with atoms. Journal of Physics: Condensed Matter 2017,
 Boes et al. 2019 Boes, J. R.; Mamun, O.; Winther, K.; Bligaard, T. Graph theory approach to highthroughput surface adsorption structure generation. The Journal of Physical Chemistry A 2019,
 Tibshirani 1996 Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996, 267–288.
 Hoerl and Kennard 1970 Hoerl, A. E.; Kennard, R. W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 1970, 12, 55–67.
 Hesterberg et al. 2008 Hesterberg, T.; Choi, N. H.; Meier, L.; Fraley, C. Least angle and ℓ1 penalized regression: A review. Statistics Surveys 2008, 2, 61–93.
 Girard et al. 2003 Girard, A.; Rasmussen, C. E.; Candela, J. Q.; MurraySmith, R. Gaussian Process Priors with Uncertain Inputs  Application to MultipleStep Ahead Time Series Forecasting. Advances in Neural Information Processing Systems 15. 2003; pp 529–536.
 Lysgaard et al. 2018 Lysgaard, S.; Jennings, P. C.; Hummelshøj, J. S.; Bligaard, T.; Vegge, T. Machine Learning Accelerated Genetic Algorithms for Computational Materials Search. 2018,
 Torres et al. 2018 Torres, J. A. G.; Jennings, P. C.; Hansen, M. H.; Boes, J. R.; Bligaard, T. LowScaling Algorithm for Nudged Elastic Band Calculations Using a Surrogate Machine Learning Model. arXiv preprint arXiv:1811.08022 2018,
 del Río et al. 2018 del Río, E. G.; Mortensen, J. J.; Jacobsen, K. W. A local Bayesian optimizer for atomic structures. arXiv preprint arXiv:1808.08588 2018,
 Müller and Brown 1979 Müller, K.; Brown, L. D. Location of saddle points and minimum energy paths by a constrained simplex optimization procedure. Theoretica chimica acta 1979, 53, 75–93.
 Bitzek et al. 2006 Bitzek, E.; Koskinen, P.; Gähler, F.; Moseler, M.; Gumbsch, P. Structural relaxation made simple. Physical review letters 2006, 97, 170201.
 Hafner 2008 Hafner, J. Abinitio simulations of materials using VASP: Densityfunctional theory and beyond. Journal of computational chemistry 2008, 29, 2044–2078.
 Press et al. 1989 Press, W. H.; Flannery, B. P.; Teukolsky, S. A.; Vetterling, W. T. Numerical recipes; Cambridge university press Cambridge, 1989.
 Hansen et al. 2015 Hansen, K.; Biegler, F.; Ramakrishnan, R.; Pronobis, W.; Von Lilienfeld, O. A.; Müller, K.R.; Tkatchenko, A. Machine learning predictions of molecular properties: Accurate manybody potentials and nonlocality in chemical space. The journal of physical chemistry letters 2015, 6, 2326–2331.
 Ward et al. 2017 Ward, L.; Liu, R.; Krishna, A.; Hegde, V. I.; Agrawal, A.; Choudhary, A.; Wolverton, C. Including crystal structure attributes in machine learning models of formation energies via Voronoi tessellations. Physical Review B 2017, 96, 024104.
 Pedregosa et al. 2011 Pedregosa, F. et al. Scikitlearn: Machine Learning in Python. Journal of Machine Learning Research 2011, 12, 2825–2830.
 Hummelshøj et al. 2012 Hummelshøj, J. S.; AbildPedersen, F.; Studt, F.; Bligaard, T.; Nørskov, J. K. CatApp: a web application for surface chemistry and heterogeneous catalysis. Angewandte Chemie 2012, 124, 278–280.
 Mentel 2014– Mentel, L. Mendeleev – A Python resource for properties of chemical elements, ions and isotopes, ver. 0.3.6. https://bitbucket.org/lukaszmentel/mendeleev, 2014–.
 Takigawa et al. 2016 Takigawa, I.; Shimizu, K.i.; Tsuda, K.; Takakusagi, S. Machinelearning prediction of the dband center for metals and bimetals. RSC advances 2016, 6, 52587–52595.
 CalleVallejo et al. 2013 CalleVallejo, F.; Inoglu, N. G.; Su, H.Y.; Martínez, J. I.; Man, I. C.; Koper, M. T.; Kitchin, J. R.; Rossmeisl, J. Number of outer electrons as descriptor for adsorption processes on transition metals and their oxides. Chemical Science 2013, 4, 1245–1249.
 Virshup et al. 2013 Virshup, A. M.; ContrerasGarcía, J.; Wipf, P.; Yang, W.; Beratan, D. N. Stochastic Voyages Into Uncharted Chemical Space Produce a Representative Library of All Possible DrugLike Compounds. Journal of the American Chemical Society 2013, 135, 7296–7303.
 Janet and Kulik 2017 Janet, J. P.; Kulik, H. J. Resolving Transition Metal Chemical Space: Feature Selection for Machine Learning and StructureProperty Relationships. The Journal of Physical Chemistry A 2017, 121, 8939–8954.
 Hammer and Nørskov 2000 Hammer, B.; Nørskov, J. K. Advances in catalysis; Elsevier, 2000; Vol. 45; pp 71–129.
 AbildPedersen et al. 2007 AbildPedersen, F.; Greeley, J.; Studt, F.; Rossmeisl, J.; Munter, T.; Moses, P. G.; Skulason, E.; Bligaard, T.; Nørskov, J. K. Scaling properties of adsorption energies for hydrogencontaining molecules on transitionmetal surfaces. Physical review letters 2007, 99, 016105.
 CalleVallejo et al. 2015 CalleVallejo, F.; Loffreda, D.; Koper, M. T.; Sautet, P. Introducing structural sensitivity into adsorption–energy scaling relations by means of coordination numbers. Nature chemistry 2015, 7, 403.
 Pauling 1932 Pauling, L. The nature of the chemical bond. IV. The energy of single bonds and the relative electronegativity of atoms. Journal of the American Chemical Society 1932, 54, 3570–3582.
 Forman 2003 Forman, G. An extensive empirical study of feature selection metrics for text classification. Journal of machine learning research 2003, 3, 1289–1305.
 Roling and AbildPedersen 2018 Roling, L. T.; AbildPedersen, F. StructureSensitive Scaling Relations: Adsorption Energies from Surface Site Stability. ChemCatChem 2018, 10, 1643–1650.
 CalleVallejo et al. 2014 CalleVallejo, F.; Martínez, J. I.; GarcíaLastra, J. M.; Sautet, P.; Loffreda, D. Fast prediction of adsorption properties for platinum nanocatalysts with generalized coordination numbers. Angewandte Chemie International Edition 2014, 53, 8316–8319.

Johor Bahru et al. 2013
Johor Bahru, U.; Darul Ta, J.; Bin Mohamad, I.; Usman, D. Standardization and Its Effects on KMeans Clustering Algorithm.
Res. J. Appl. Sci. Eng. Technol. 2013, 6, 3299–3303.  Fan and Lv 2008 Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2008, 70, 849–911.
 Li et al. 2012 Li, G.; Peng, H.; Zhang, J.; Zhu, L. Robust rank correlation based screening. Ann. Statist. 2012, 40, 1846–1877.
Comments
There are no comments yet.