Structure-Property Maps with Kernel Principal Covariates Regression

02/12/2020 ∙ by Benjamin A. Helfrecht, et al. ∙ EPFL 8

Data analysis based on linear methods, which look for correlations between the features describing samples in a data set, or between features and properties associated with the samples, constitute the simplest, most robust, and transparent approaches to the automatic processing of large amounts of data for building supervised or unsupervised machine learning models. Principal covariates regression (PCovR) is an under-appreciated method that interpolates between principal component analysis and linear regression, and can be used to conveniently reveal structure-property relations in terms of simple-to-interpret, low-dimensional maps. Here we provide a pedagogic overview of these data analysis schemes, including the use of the kernel trick to introduce an element of non-linearity in the process, while maintaining most of the convenience and the simplicity of linear approaches. We then introduce a kernelized version of PCovR and a sparsified extension, followed by a feature-selection scheme based on the CUR matrix decomposition modified to incorporate the same hybrid loss that underlies PCovR. We demonstrate the performance of these approaches in revealing and predicting structure-property relations in chemistry and materials science.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 6

page 7

page 8

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Over the past decade, there has been a tremendous increase in the use of data-driven and machine learning (ML) methods in materials science, ranging from the prediction of materials properties Faber et al. (2016, 2017); Hansen et al. (2013); Rupp et al. (2012), to the construction of interatomic potentials Deringer and Csányi (2017); Dragoni et al. (2018); Maillet, Denoual, and Csányi (2018); Szlachta, Bartók, and Csányi (2014) and searches for new candidate materials for a particular application Simon et al. (2015); Sendek et al. (2017); Kahle, Marcolongo, and Marzari (2020); Kirklin, Meredig, and Wolverton (2013)

. Broadly speaking, these methods can be divided into two categories: those that are focused on predicting the properties of new materials (supervised learning), and those that are focused on finding or recognizing patterns, particularly in atomic structures (unsupervised learning). While supervised methods are useful for predicting properties of materials with diverse atomic configurations, they are not as well-suited for classifying structural diversity. Conversely, unsupervised methods are useful for finding structural patterns, but often cannot be used to directly predict materials properties. Moreover, it can be difficult to validate motifs identified by an unsupervised learning algorithm–as the results obtained from the clustering algorithm depend on the choice of structural representation, and can therefore be biased by preconceived expectations on what should be most relevant features 

Ceriotti (2019).

Methods that combine the predictive power of supervised ML and the pattern recognition capabilities of unsupervised ML stand to be very useful in materials informatics, making it possible to increase data efficiency and revealing more clearly structure-property relations. A number of statistical methods have been developed for augmenting regression models to incorporate information about the structure of the input data, including principal component regression

Jolliffe (1982), partial least squares regression Wold, Sjöström, and Eriksson (2001), clusterwise regression Späth (1979), continuum regression Stone and Brooks (1990), and principal covariates regression (PCovR) de Jong and Kiers (1992); Vervloet et al. (2013, 2015, 2016). Among these, PCovR is particularly appealing, because it transparently combines linear regression (LR; a supervised learning method) with principal component analysis (PCA; an unsupervised learning method). The method has found previous applications in climate science Fischer (2014), macroeconomics Heij, Groenen, and van Dijk (2007), social science Vervloet et al. (2015), and bioinformatics Van Deun, Crompvoets, and Ceulemans (2018); Taylor et al. (2019), but has yet to be widely adopted. A handful of extensions have been developed for PCovR, including a combination with clusterwise regression Wilderjans et al. (2017), and regularized models Fischer (2014); Van Deun, Crompvoets, and Ceulemans (2018).

In this paper, we propose a kernel-based variation on the original PCovR method, which we call Kernel Principal Covariates Regression (KPCovR), with the aim of making it even more versatile for statistics and machine learning applications. We also introduce an extension of an algorithm based on a CUR matrix decomposition Mahoney and Drineas (2009) that selects the most important features Imbalzano et al. (2018) to incorporate a PCovR-style leverage score; we call this algorithm PCov-CUR, and we demonstrate its superior performance to select features for supervised learning. We begin by summarizing the required background concepts and constituent methods used in the construction of linear PCovR in addition to the kernel trick, which can be used to incorporate an element of nonlinearity in otherwise linear methods. We then introduce KPCovR and PCov-CUR.

Ii Background Methods

We start by giving a concise but complete overview of established linear methods for dimensionality reduction and regression, as well as their kernelized counterparts. This is done to set a common notation and serve as a pedagogic introduction to the problem, complemented by the Jupyter notebooks provided in the SI. More experienced readers can skip this section and proceed to Section III, where we introduce kernelized PCovR methods. Throughout this discussion, we demonstrate the methods on 2500 environments from the CSD-1000r dataset Ceriotti et al. (2019)

, which contains the NMR chemical shielding of nuclei in a collection of 1000 organic crystals and their 129,580 atomic environments. For clarity of exposition and to use an easily interpretable example, we classify and predict simultaneously the chemical shieldings of all nuclei, even though in actual applications one usually would deal with one element at a time. As the input features, we use the SOAP power spectrum vectors, which discretize a three-body correlation function including information on each atom, its relationships with neighboring atoms, and the relationships between those neighboring atoms

Bartók, Kondor, and Csányi (2013); Willatt, Musil, and Ceriotti (2019).

ii.1 Statement of the Problem and Notation

We assume that the input data has been processed in such a way that the nature of each sample (e.g. composition and structure of a molecule) is encoded as a row of a feature matrix . Each sample is a vector of length , so that has the shape . Similarly, the properties associated with each datum are stored in a property matrix , which has the shape . We denote the data in latent space (i.e., a low-dimensional approximation of ) as . We denote each projection matrix from one space to another as , where is the original space and is the projected space. As such, the projection matrix from the input space to is and vice versa: is the projection matrix from to the input space. Note that in general are not assumed to be orthogonal nor full-rank.

To simplify notation, and to work with unitless quantities, we assume in our derivations that both and

are centered according to their respective column means and are normalized so that the variance of

and of each individual property in is one. A similar centering and scaling procedure is also applied when working with kernels Schölkopf, Smola, and Müller (1998). Centering is discussed in more detail in the Jupyter notebooks that are provided in the SI. To make notation less cumbersome, variables names are not defined uniquely across the entirety of the paper. We re-use variable names for common elements among the different subsections—for example, using to represent a low-dimensional latent space in all methods—but the precise definitions of the re-used variables differ between subsections and should not be confused with one another. We also use throughout a few additional coventions: (1) we write an approximation or truncation of a given matrix as ; (2) we use to signify an augmented version of — that is, is defined differently from , but occupies the same conceptual niche; (3) we represent the eigendecomposition of a symmetric matrix as , where

is a diagonal matrix containing the eigenvalues, and

the matrix having the corresponding eigenvectors as columns.

Figure 1: A schematic representation of the different linear operations that can be performed to model structure-property relations in terms of a matrix of features that represents the input samples.

ii.2 Linear Methods

We begin by discussing models of the form:

(1)

where is a target quantity (e.g., a property that one wants to predict or an alternative, lower-dimensional representation of the feature matrix ), and is a linear projection that maps the features to the target quantityHastie, Tibshirani, and Friedman (2008); Bishop (2006). A graphical summary of the mappings that we consider in this paper is depicted in Figure 1.

Figure 2: Projection and Regression Models of CSD-1000r.

Projections: (a) Principal Components Analysis (PCA) and Multidimensional Scaling (MDS), (c) Kernel PCA, and (e) Sparse Kernel PCA. In each projection, the property values are denoted by marker color and the marker symbol denotes the central atom of each environment, which corresponds to the cluster in the projection. Regressions: (b) Ridge Regression, (d) Kernel Ridge Regression (KRR), and (f) Sparse KRR, with

. In each regression, the target is denoted by a dotted line. Colors denote absolute error of predicted properties, and inset includes the and .

ii.2.1 Principal Component Analysis

In principal component analysis F.R.S (1901); Hotelling (1933), the aim is to reduce the dimensionality of the feature matrix by determining the orthogonal projection which incurs minimal information loss. More formally, we wish to minimize the error of reconstructing from the low-dimensional projection: . The requirement that is orthonormal implies that

. Using the properties of the Frobenius norm, this loss function can be rewritten as

(2)

which is minimized when the similarity

(3)

is maximized. Given the orthogonality constraint on , the similarity is maximized when corresponds to the eigenvectors of the covariance that are associated to the largest eigenvalues. We introduce the eigendecomposition , where is the matrix of the eigenvectors and the diagonal matrix of the eigenvalues, so that

(4)

where we use the notation to indicate the the matrix containing only the top components. The outcomes of a PCA with of the CSD-1000r dataset are shown in Fig.2(a). The atomic environments are split clearly according to the nature of the atom sitting at the centre of the environment, reflecting the prominence of this information in the SOAP features we use. Within each cluster, one can observe less clear-cut subdivisions and a rough correlation between position in latent space and value of the chemical shielding. However, given the fact that each nucleus has chemical shielding values that span very different intervals, the correlation between position in latent space and value of the shielding is rather poor.

ii.2.2 Multidimensional scaling

A reduction in the dimension of the feature space can also be achieved with a different logic that underlies several methods grouped under the label of multidimensional scaling (MDS) Torgerson (1952). In MDS, the projected feature space is chosen to preserve the pairwise distances of the original space, defining the loss

(5)

where and refer to the full and projected feature vector of the -th sample. In general, Eq. (5) requires an iterative optimization. When the distance between features is the Euclidean distance, as in classical MDS, the link between the metric and the scalar product suggests minimizing the alternative loss

(6)

Note that the solutions of Eqs. (5) and (6) concur only if one can find a solution that zeroes . If the eigenvalue decomposition of the Gram matrix reads , is minimized when

is given by the singular value decomposition of

, that is by taking

(7)

restricted to the largest eigenvectors. However, and have the same (non-zero) eigenvalues, and the (normalized) eigenvectors are linked by . Hence, one sees that , which is consistent with Eq. (4). Thus classical MDS yields the same result as PCA in Fig.2(a).

ii.2.3 Linear Regression

In linear regression, one aims to determine a set of weights to minimize the error between the true properties and the properties predicted via . In the following, we consider the case of an regularized regression with regularization parameter , i.e., ridge regression Hastie, Tibshirani, and Friedman (2008). The loss to be minimized is

(8)

Minimizing the loss with respect to yields the solution . If one chooses to perform the regression using the low-dimensional latent space and approximate with , then .

The ridge regression of the CSD-1000r dataset is shown in Fig.2(b). Given the small train set size, and the difficulty of fitting simultaneously shieldings on different elements, the model achieves a very good accuracy, with a RMSE below 1 ppm.

ii.2.4 CUR Decomposition

PCA makes it possible to identify the directions of maximal variance of a dataset. Finding a projection of a new point along those coordinates, however, requires computing all the features in the initial description of the problem. CUR decomposition Mahoney and Drineas (2009); Imbalzano et al. (2018), instead, attempts to find a low-rank approximation of the feature matrix that can be computed without having to evaluate every feature, or without using every point in the data set 111Usually the matrices that enter the decomposition are labelled , , and , with reference to columns and rows of the initial matrix. Here we use , , to avoid naming conflicts with other matrices used in the rest of the paper.,

(9)

where is a matrix composed of a subset of the columns of , a matrix composed of some of its rows, and is a rectangular matrix that yields the best approximation for for given and .

Many strategies have been proposed to choose the best columns and rows Mahoney and Drineas (2009). We summarize a variation on the theme that was introduced by Imbalzano et al. (2018), that is comparatively time consuming, but deterministic and efficient in reducing the number of features needed to approximate . A schematic of this method is given in Fig. 3.

Figure 3: CUR Column Selection (1) The importance score is computed for each column; the column corresponding to the largest is selected. (2) The matrix is orthogonalized with respect to column , is recomputed, and the column corresponding to largest is selected. (3) Step (2) is repeated until the desired number of columns is obtained.

First, one performs a singular value decomposition Golub and Reinsch (1970); Klema and Laub (1980) of . are the right principal components of , and coincide with the eigenvectors of the covariance . are the left principal components, and are equal to the eigenvectors of the Gram matrix . is a rectangular-diagonal matrix with the singular values, which equal the square roots of the non-zero eigenvalues of either or . Next, we select the top right principal vectors (typically gives the best performances) and compute the leverage scores corresponding to the relative importance of each column. Third, we select the feature for which is highest. Finally, we remove the corresponding column from , and orthogonalize each remaining column with respect to the column that has just been removed,

(10)

These four steps are iterated until a sufficient number of features has been selected. The matrix is then the subset of columns of that have been chosen.

If CUR is used to simply select features, the row matrix is just the full , and one can determine by computing the pseudoinverse of ,

(11)

The approximate is a matrix, but is low-rank. It is possible to obtain a reduced feature matrix such that , which can then be used in all circumstances where scalar products or distances in a reduced feature space are necessary. Writing explicitly the expression for , one sees that

(12)

The matrix can be computed once and re-used every time one needs to compute for new samples. Finally, note that CUR could also be used to select samples rather than features, by simply working on the left singular values, or equivalently by taking as the “feature” matrix.

ii.3 Principal Covariates Regression

Principal covariates regression (PCovR) de Jong and Kiers (1992) represents a combination between PCA and LR that attempts to find a low-dimensional projection of the feature vectors that simultaneously minimizes information loss and error in predicting the target properties using only the latent space vectors .

PCovR is formulated as the optimization of a loss that combines the individual PCA and LR losses, with a tuning parameter that determines the relative weight given to the two tasks,

(13)

The derivation we report here follows closely that in the original article de Jong and Kiers (1992).

ii.3.1 Sample-space PCovR

It is easier to minimize Eq. (13) by looking for a projection in an auxiliary space for which we enforce orthonormality, . Here, represents the whitened projection in latent space.

This allows us to write and . Noting that by definition , we can express the loss as

(14)

This loss is minimized by maximising the associated similarity

(15)
(16)

where we have substituted with the regression approximation , which is the only part that can be approximated in a subspace of the feature space. If we define the modified Gram matrix

(17)

we can further write the similarity as

(18)

The latent space projections that maximize the similarity correspond to the principal eigenvectors of the matrix , . By analogy with multi-dimensional scaling—and to ensure that in the limit of we obtain the same latent space as in MDS—one can define the projections as . Note the similarity to Eq. (7). The projection from feature space to the latent space is then given by

(19)

The projection matrix from the latent space to the properties can be computed from the LR solution

(20)

ii.3.2 Feature-space PCovR

Rather than determining the optimal PCovR projections by diagonalizing the equivalent of a Gram matrix, one can tackle the problem in a way that more closely resembles PCA by instead diagonalizing a modified covariance matrix. One must first note that from which we see that is orthogonal. We can thus rewrite the similarity function from Eq. (18) as

(21)

introducing

(22)

The similarity is maximized when the orthogonal matrix

matches the principal eigenvalues of , i.e. . Note that in general is not a symmetric matrix, and so it is not possible to define an orthonormal such that . Consistently with the case of sample-space PCovR, we define

(23)

which minimizes the PCovR loss in Eq. (13), reduces to PCA for and—if the dimension of the latent space is at least as large as the number of target properties in —reduces to LR for .

Figure 4 demonstrates the behavior of PCovR when applied to the analysis of the CSD-1000r dataset. For , we recover the accuracy of pure LR in predicting the values of the chemical shielding, but obtain a latent space that misses completely the structure of the dataset. The first principal component reflects the LR weight vector, and the second carries no information at all. For , we recover the PCA projection, that separates clearly the environments based on the nature of the central atom. A linear model built in the two-dimensional latent space, however, performs rather poorly, because there is no linear correlation between the position in latent space and the shielding values. Intermediate values of yield a projection that achieves the best of both worlds. The regression error is close to that of pure LR, but the error in the reconstruction of the input data from the latent space is now only marginally increased compared to pure PCA. There still exists a recognizable clustering of the environments according to central atom species, but the O cluster, that exhibits the largest variance in the values of the shielding, is spread out diagonally so as to achieve maximal correlation between the position in latent space and value of the target properties.

Figure 4: Principal Covariates Regression of CSD-1000r. Combining ridge regression (far left) and PCA (far right) with mixing parameter , PCovR can minimize the total loss .

ii.4 Kernel Methods

While linear methods have the beauty of simplicity, they rely on the knowledge of a sufficient number of informative features that reflect the relation between inputs and properties. Kernel methods make it possible to introduce a non-linear relation between samples in the form of a positive-definite kernel function (e.g. the Gaussian kernel , or the linear kernel ), and use it to define a higher-dimensional space in which data points serve effectively as an adaptive basis Cuturi (2009). Doing so can help uncover nonlinear relationships between the samples, resulting ultimately in more effective determination of a low-dimensional latent space and in increased regression performance.

Mercer’s theorem Mercer and Forsyth (1909) guarantees that given a positive definite kernel there is a linear operator that maps input features into a (possibly infinite-dimensional) reproducing kernel Hilbert space (RKHS) Cuturi (2009) whose scalar product generates the kernel, i.e. . Note that is not necessarily known explicitly, but as we will see it can be approximated effectively for a given dataset, and we will use the notation to indicate the feature matrix that contains the (approximate) values of the kernel features for all of the sample points. We indicate with the matrix that contains as entries the values of the kernel function between every pair of samples. In the case of a linear kernel, this is simply the Gram matrix computed for the input features, while for a non-linear kernel its entries can be computed by evaluating the kernel between pairs of samples, . Some subtleties connected to the centering of kernels are discussed in the Jupyter notebooks provided in the SI.

ii.4.1 Kernel Principal Component Analysis

Kernel principal component analysis Schölkopf, Smola, and Müller (1998) is analogous to classical MDS, to which it corresponds exactly when a linear kernel is used. To construct a KPCA decomposition, one computes the eigendecomposition of the kernel matrix and defines the projections as the principal components . Note that, if one retains all the eigenvectors, corresponds to an exact approximation of the kernel features for the given dataset, as . The projections can also be computed as , and this second expression can be used to project new data (using in place of the matrix containing the values of the kernel matrix between new and reference points) in the approximate RKHS defined by the original samples. As shown in Fig. 2c, for this dataset there is little qualitative difference between what we obtained with plain PCA and the KPCA projection. This is because SOAP features exhibit very clear correlations with the nature of the central environments, which is already well represented with a linear model. The three peripheral clusters appear more elongated, and a few O-centred environments are projected very close to those centred around a N atom.

ii.4.2 Kernel Ridge Regression

Kernel ridge regression Girosi, Jones, and Poggio (1995); Smola and Schökopf (2000) is analogous to ridge regression, except that the kernel feature space vectors are substituted for the original input data , giving the loss

(24)

so that the optimal weights are

(25)

Predicted properties can then be evaluated with . One can avoid computing explicitly the RKHS features by redefining the weights as so that .Murphy (2012) We can then write the predicted properties as

(26)

As shown in Fig. 2d, the greater flexibility afforded by a kernel model reduces the error by almost 50%.

ii.5 Sparse Kernel Methods

Since the size of kernel matrices grows with , one wants to avoid computing (and inverting) the whole kernel matrix for large datasets. Instead, we can formulate a low-rank approximation to the kernel matrix through the Nyström approximation Williams and Seeger (2001), using a subselection of the data points (the active set) to define an approximate RKHS. These representative points can be selected in a variety of ways; two straightforward methods that have been used successfully in atomistic modelling are farthest point sampling (FPS) Eldar et al. (1997) and a CUR matrix decomposition Mahoney and Drineas (2009); Bartók and Csányi (2015); Imbalzano et al. (2018).

Using the subscript to represent the full set of training data and to indicate the active set, one can explicitly construct the approximate feature matrix as , where and are from the eigendecomposition of . All sparse kernel methods can be derived as linear methods based on these features, although it is often possible to avoid explicitly computing . For instance, the approximate kernel matrix takes the form Williams and Seeger (2001)

(27)

ii.5.1 Sparse Kernel Principal Component Analysis

Using the approximate (centered) feature matrix , we can define the covariance in the kernel feature space along with its eigendecomposition,

(28)

and subsequently compute the projections analogously to standard KPCA,

(29)

which effectively determine the directions of maximum variance of the samples in the active RHKS.

Fig. 2e shows that with an active set size of just 50 samples, selected by FPS Imbalzano et al. (2018), one can obtain a KPCA latent projection that matches very well the qualitative features of the full KPCA construction.

ii.5.2 Sparse Kernel Ridge Regression

In sparse KRR we proceed as in standard KRR, but use the feature matrix from the Nyström approximation. The corresponding regularized LR loss in the kernel feature space is

(30)

for which the solution is

(31)

Alternatively, we can redefine the weights so that

(32)

from which we see that

(33)

By writing out explicitly in terms of we obtainSmola and Schökopf (2000)

(34)

As shown in Fig. 2f, an active set size of 50 is not sufficient to achieve an accurate regression model, and the error is larger than with a linear regression method. However, the error can be reduced systematically by increasing the size of the active set, finding the best balance between accuracy and cost.

Iii Extensions to Principal Covariates Regression

After having summarized existing linear and kernel methods for feature approximation and property prediction, we now introduce kernelized PCovR (KPCovR), as a way to combine the conceptual framework of PCovR, weighting principal coordinates determination and regression, and the non-linear features afforded by a kernel method.

iii.1 Full kernel PCovR

Figure 5: Kernel Principal Covariates Regression of CSD-1000r. Combining kernel ridge regression (far left) and KPCA (far right) with mixing parameter .

We start by constructing the augmented kernel matrix as a combination of KPCA and KRR. In particular, we substitute for and the KRR solution of , , for , so that we have

(35)

where we normalize the kernel matrix in a way that is equivalent to normalizing . Just as in PCovR, the unit variance projections are given by the top eigenvectors of , and the non-whitened projections as , corresponding to the kernel features associated with the PCovR kernel [Eq. (35)].

Projecting a new set of structures in the kernel PCovR space entails computing the kernel features between the new samples and the samples that were originally used to determine the KPCovR features. Given that one may not want to compute those features explicitly, it is useful to define a projection acting directly on the kernel, such that :

(36)

We also determine the matrix that enables predictions of properties from the latent space through LR, just as in the linear case [Eq. (20)].

As shown in Fig. 5, the method behaves similarly to linear PCovR, performing at its best in the low- regime, and converging to a KPCA-like behavior for . For the optimal value of , the latent-space map (shown in Fig. 6b) combines a clear separation of structurally-distinct clusters with the possibility of predicting the target property based on a linear model in the latent space, with an accuracy that approaches that of KRR, outperforming linear PCovR in terms of regression performance.

iii.2 Sparse Kernel PCovR

Our derivation of the sparse version of KPCovR can be obtained almost directly from that of feature-space PCovR by taking explicitly the projection of the kernel on the active RKHS . One can then define the covariance of the active kernel features

(37)
(38)

and use it in the definition of the modified KPCovR covariance

(39)

With these definitions, the projection matrices onto the (sparse) KPCovR latent space and onto the latent-space-restricted properties, analogous to Eq. (23), read

(40)

Similar to what we observed for sparse KPCA and KRR, reducing the active space to 50 active samples preserves the qualitative features of the latent-space map, but leads to substantial loss of performance for regression (Fig. 6c). The use of a hybrid PCovR loss and a 2-D latent space, however, does not entail any substantial further loss of accuracy.

Figure 6: Projections and Regression from PCovR Models. Analogous to Fig.2, we show the loss incurred from the projections and regression of the three PCovR models at .(a) Principal Covariates Regression (PCovR), (b) Kernel PCovR (KPCovR), and (c) Sparse KPCovR with . was chosen to best compare the models, although ideal will change between models, albeit often with a range of suitable , as in Fig.4. Colors denote absolute error of predicted properties, and inset includes the and .

iii.3 PCovR-CUR feature selection

The PCovR formulation is particularly appealing for selecting a small number of ML features to be used for machine learning, because it incorporates information on the ability of different features to predict the target properties. We propose to proceed as in Section II.2.4, but to compute the leverage scores by computing the eigenvectors of the PCovR covariance matrix (given by Eq. (22)) in place of the singular value decomposition. We found that the best results are obtained by computing leverage scores using a number of eigenvectors that is smaller or equal than the number of properties in .

The only additional change that needs to be incorporated in the procedure described in Sec. II.2.4 involves eliminating at each step the components of the property matrix that can be described by the features that have already been selected. Computing at each step the reduced vector as in Eq. (12), one should perform the update

(41)

so that the next iteration of the CUR selection identifies the features that are best suited to describe the residual part of the properties. Note that in order to select samples with PCov-CUR, one should compute the leverage scores based on the eigenvectors of . Fig. 7 compares different approaches for feature selections, in terms of the coefficient, the RMSE in predicting chemical shieldings, and the error in approximating the kernel/Gram matrix based on . PCovR-CUR outperforms systematically the other methods by 20-50% in terms of the RMSE of the reduced dimensional model, although it incurs a penalty in the accuracy by which it approximates the kernel—which is a lesser concern when the objective is building an accurate supervised model. Interestingly, the performance of the models is not a monotonic function of

, and a loss weight that is heavily skewed towards KPCA loss (

) outperforms systematically a selection based on leverage scores. In the limit of large number of features, FPS—a much simpler and inexpensive scheme—is very competitive with CUR-based schemes, although it performs rather poorly for fewer than about 100 features.

Figure 7: Regression and Torgerson Losses for Different Feature Selection Metrics for CSD-1000r Metrics shown are CUR (orange), Farthest Point Sampling (black), random feature selection (blue), and CUR with PCovR (purple). For PCovR-CUR, the range of errors is shown in light purple, with separate lines drawn for different at an interval of 0.05, with the exception of from , where . Thicker, patterned lines are shown for , the second of which minimizes the regression error for intermediate number of features.

Iv Conclusions

In this paper we provide a comprehensive overview of linear and kernel-based methods for supervised and unsupervised learning, and show an example of their application to elucidate and predict structure-property relations in chemical and materials problems. We also discuss a simple extension of principal component analysis and linear regression, PCovR de Jong and Kiers (1992), that has as yet received far less attention than in our opinion it deserves. We derive extensions to PCovR that make it possible to use it in the context of kernel methods (KPCovR and sparse KPCovR), and discuss how the rationale behind it can be put to good use to improve the selection of the most relevant features for the construction of supervised learning models, to yield a PCov-CUR scheme, that outperforms schemes based exclusively on an analysis of how correlations between structural features. We provide a set of Jupyter notebooks that provide a pedagogic introduction to both traditional and novel methods we discuss, and allow exporting structure-property maps in a format that can be visualized with an interactive tool that we also developed as part of this workche . This study highlights the promise of combining supervised and unsupervised schemes in the analysis of data generated by atomistic modeling, to obtain clearer insights to guide the design of molecules and materials with improved performance, and to build more effective models to directly predict atomic-scale behavior.

References