1 Introduction
As one of the typical multicomponent alloys, high entropy alloys (HEAs) consisting of four or more principal elements have been widely studied due to their exceptional mechanical properties yeh2004nanostructured; senkov2011mechanical; zhang2014microstructures; li2016metastable. The increased number of elements expand the possible combinations and potential candidates for discovering nextgeneration materials with enhanced properties tsai2014high; miracle2017critical; gao2016high
. Typically, the material properties are inherently linked to the actual state of chemical ordering, much efforts have been therefore devoted to analyze the degree of chemical ordering and to identify the orderdisorder phase transitions
widom2014hybrid; gao2016high; murty2019high; eisenbach2019first. Due to expensive time costs in experimental research, computational simulations, typically firstprinciples calculations are playing an increasingly central role in the investigation of various properties of HEAs widom2018modeling; gao2017computational; ikeda2019ab.Firstprinciples density functional theory (DFT) methods have established as a powerful and reliable tool in computational material science and have enabled critical advancements in materials properties and performance discovery ye2016high; ma2015ab. With the increasing numerical efficiency and growing computing power (parallel and GPU computing), it is still difficult to address the challenge of DFT calculations in relatively large supercells (thousands of atoms) and intensive sampling (huge number of configurations) huang2017construction. To characterize the orderdisorder phase transition, a straightforward way is to combine the DFT method with Monte Carlo simulations. However, this “bruteforce” method is so computationally intensive that it is often impractical, even for a simple example, 250atom CuZn alloy khan2016density. Consequently, it is quite necessary to establish an approximate configurational energy model fitted to DFT data and feed this accurate enough and efficient “surrogate” model into Monte Carlo simulations for modeling thermodynamics and orderdisorder phase transitions.
A typical and effective strategy is cluster expansion (CE) method kikuchi1951theory; sanchez1984generalized; van2002automating, which uses discrete sum representation of material properties, for example, configurational energies, in terms of lattice site configuration and site effective cluster interactions (ECIs), such as site pairs, triplets, quadruplets and the other high order interactions. The fundamental challenge in constructing CE model is to determine the ECIs in an efficient and robust way through a fit to reference configuration. Commonly used fitting algorithm is to minimize the overall difference between the CE fitted energy and DFT calculated energies with respect to different input configurations. Practically, CE has to be truncated and many advanced methods have been proposed to aid in the efficient and accurate performance of a truncated CE. These methods include as compressive sensingnelson2013compressive, Bayesian method mueller2009bayesian, cluster basis set selectionblum2005using; hart2005evolutionary; seko2009cluster, machine learning natarajan2018machine and regularization chang2019clease; aangqvist2019icet. Although the CE is effective Hamiltonian and can be potentially combined with Monte Carlo simulation to account for orderdisorder phase transitions and chemical shortrange order, its application to multicomponent systems is still intractable due to rapid increased combinatorial number of interatomic interactions concerning chemical elements jiang2016efficient; seko2009cluster. Therefore, fitting a CE for multicomponent alloys, i.e. HEAs, becomes extremely difficult widom2018modeling.
The recent development of machine learning presents exciting opportunities and challenges to various scientific fields jordan2015machine; lecun2015deep. Benefiting from advanced learning algorithms and large databases using highthroughput computations, machine learning has been widely applied to materials research and discoverybutler2018machine; mueller2016machine; sanchez2018inverse. Some examples of successful applications include discovering complex materials behaviorkim2016organized; raccuglia2016machine, accurate prediction of phase transitions and predictioncarrasquilla2017machine; huang2019machine; kostiuchenko2019impact, accelerated material design and prediction of material propertiesfujimura2013accelerated; pilania2013accelerating; ward2016general, modeling of various physical quantities, for instance, interatomic potentials dragoni2018achieving; chmiela2017machine; deringer2017machine and atomic forceschmiela2017machine; li2015molecular. Compared to successful applications in other fields, few studies have been conducted in the context of machine learning for the modeling of thermodynamics of HEAs. This is because the inherent challenges originated from the extremely large configuration space associated with the multicomponent alloys. In many cases, only small dataset is drawn from expensive DFT calculations due to limited computational cost, it therefore gives rise to the issue of uncertainty quantification in model inference and certified predictions aldegunde2016quantifying; kristensen2014bayesian; zhang2018quantification. The learned model also faces additional challenges to capture the underlying physics with important features and cover the overall configurational space in an accurate and robust scheme liu2019machine.
In this work, we develop an efficient and robust Bayesian framework by fitting an accurate, featureselected efficient Hamiltonian which is employed in subsequent Monte Carlo simulations for modeling the thermodynamics and orderdisorder phase transitions. Bayesian regularized regression is employed to deal with the unstable prediction due to sparse data, on the other hand, to effectively quantify the uncertainties associated with the EPI parameters. To investigate the impact of model complexity, we conduct physical feature selection using Bayesian information criterion that allows for effective truncation of the coordination shells given a specific dataset. We demonstrate the accuracy and reliability of prediction with feature selection are significantly higher than the prediction with an arbitrary truncation, specifically when data are limited. This first section of this paper presents a brief overview of classical cluster expansion and the proposed effective pair interaction model liu2019machine. Then we propose a robust datadriven framework which consists of Bayesian regularized regression and Bayesian feature selection that enables to effectively reduce the model complexity that is very difficult to conventional CE model in HEAs. In the second section, we apply the proposed robust algorithm to three refractory HEAs, i.e. NbMoTaW, NbMoTaWV and NbMoTaWTi. The ensemble random sampling performs well in predicting of configurational energy compared with the approach that uses only single supercell. Due to the limitation of dataset size, we systematically present the uncertainty quantification and correlation analysis of model parameters in terms of the Bayesian framework. Moreover, the effect of physical feature selection is carefully investigated for these three HEAs. Finally, the conclusions of the current work are summarized.
2 Theory and algorithm
2.1 Revisited cluster expansion method
cluster expansion (CE) method is widely used approach for thermodynamic simulation of binary alloys due to its versatility and simplicity. Specifically, a binary alloy of
sites can be represented as a vector of occupation configurations,
where ‘spin’ variable takes a value of or depending on the occupant (atom X or Y) of site . A property of this binary crystal that depends on can be formulated as a polynomial expansion in terms of occupation configurations(1) 
where the expansion coefficients are called the effective cluster interactions (ECIs) which are independent of the configurations and often determined by the crystal structure and chemistry of the binary alloy, is a constant that represents the empty cluster and is the site cluster function, defined as the product of basis function , which is given by:
(2) 
Note that the cluster functions in Eq. (2) form a complete orthonormal basis on the configuration space . When all possible cluster functions are considered in CE model, Eq. (1) is an exact expression. However, a more practical way for CE is a truncated summation over finite number of cluster functions considering the many sites that are far apart are usually negligible. Typically, the energy is primarily determined by shortrange interactions, it is therefore natural to represent the energy as a summation of interactions whose strength diminished with increasing range, which is given by:
(3) 
where the , and represent the interaction strength of point clusters, pair clusters and triplet clusters, and can be determined by the DFT calculated total energies of different configurations in a variety of supercells. One can therefore utilize CE as an efficient Hamiltonian in Monte Carlo simulation to reveal orderdisorder phase transitions.
However, it is often a challenging task to fit the ECIs of CE in multicomponent crystalline solids because of the number of terms scales as for an body terms for species. The series need many terms and the number of terms grows rapidly with the diameter of the cluster. As additional terms are added, the series coefficients may converge poorly given limited number of configurations.
2.2 Effective pair interaction model
Conventional CE method is difficult when applied to HEAs, we herein propose to use an Isinglike model with only effective pair interactions (EPIs) without considering highorder interactions. Fig. 1 shows the prototype square lattice with effective pair interactions. In terms of the pair distance, we define a series of shortrange pair interactions, for example, the nearestneighbor, the next nearestneighbor and so on. The orbit of a specific pair centered around site consists of all the equivalent pair interactions. Therefore, the effective Hamiltonian at lattice site can be expressed as:
(4) 
where is the interatomic pair potential between element and , is referred to as element at site , is the number of coordination shell separating between and , is the occupation parameter, and is the concentration dependent part, which can be discarded for a given composition. Summing up the Hamiltonian over all atomic sites yields the total energy, which is given by:
(5) 
where is the percentage of bonds in the th coordination shell. Considering an component alloy system, the total number of different chemistry bonds in th shell is but there are constraints from the concentration of each element for a fixed chemical composition, for example,  and . As a result, the number of independent variables in an component alloy system is
(6) 
which consists of the nearestneighbor shortrange order (SRO) parameters that exist at th shell for an component alloy. The WarrenCowley SRO parameters is defined as
(7) 
where is the concentration of element , and
is the probability of finding element
at the th neighbor shell of element . is a critical parameter to characterize the different chemical configurations. means the preference of form bonds at the th shell, indicates the opposites and for each suggests to a completely random system. In fact, there is an effective pair interaction (EPI) corresponding to each SRO parameter. Consequently, Eq. (5) can be further written by(8) 
where is closely related to the SRO parameter in Eq. (7). For example, for the fourcomponent NbMoTaW refractory HEAs, there are total ten different bonds for each coordination shell but only six independent bonds, including NbMo, NbTa, NbW, MoTa, MoW and TaW. Given a specific configuration of multicomponent HEAs NbMoTaW, it is not difficult to calculate the , , , , and at the th neighbor shell. The corresponding interatomic pair coefficients , , , , and at the th neighbor shell in Eq.(8
) can be determined by linear regression using
as the featuresliu2019machine. The cost of building an EPI model comes primarily from the cost of generating the training dataset, which is usually by DFT to which the EPIs are fitted. It therefore gives rise to two critical questions: 1) how to conduct an accurate and robust prediction that minimizes the error of energy for a given training dataset and 2) how to determine the number of physical feature when data are sparse. These would correspond to cluster selection in which the optimal set of clusters is selected for inclusion in the expansion in a robust way that minimizes the expected prediction error.2.3 Bayesian regularized regression
To obtain a EPI for a specific HEAs one must determine the interatomic pair potential in Eq. (8), which can be cast in a matrix form,
(9) 
where are independent and identically distributed (i.i.d.) variables that follow . is a matrix containing the probability quantities of the training data where each element in row at the th shell is defined as
(10) 
where is the number of training data (configurations). It is necessary to note that there are column vectors for each shell and is therefore a matrix. is a column vector in which the th element in the physical quantity (for example, total energy) of the configuration and is a column vector in which th shell is which also includes elements. Determining the EPIs by solving the linear system given by Eq. (10), it is equivalent to finding the parameter vector , which minimizes the residual sum of squared errors (RSS)
using ordinary least squares (OLS) method. Typically, OLS method often has low bias but larger variance. A solution can be determined by OLS method, which performs well in the overdetermined system. Due to the large computational cost in DFT calculations, the linear system in Eq. (
10) is, however, often underdetermined and therefore leads to an illposed problem. Another drawback associated with OLS method is the susceptibility to possible overfitting chang2019clease, which refers to that the EPIs values are overtuned to predict physical quantity in training dataset but losing the predictability for the new configurations that are “unseen” before. Meanwhile, the nearsightedness of physical interaction in CE suggests sparse property for nelson2013compressive.Regularization is an effective way to combat overfitting and achieve sparse solutions by adding a regularization term in the form of or norm. For regularization, the optimal EPI values can be solved by
(11) 
where is a penalty parameter that determines the amount of regularization. The primary benefit of regularization is its promotion of sparsity, which is achieved by feature selection with a set of EPI values set to zero. However, in principle, this shrinkage is performed only based on the correlation of feature but ignoring the inherent physical interactions such that sometimes it incorrectly forces the EPI parameters to zero, consequently leads to a unstable prediction, specifically under the case of sparse training dataset. Instead, this study prefers to use the penalty for both fitting and penalization of the EPI coefficients. Thus, the solution becomes
(12) 
which is the most popular technique for improving prediction accuracy by shrinking large regression coefficients to reduce overfitting. Unlike regularization, regularization tends to contain all physical interaction information by only shrinking the size of EPI coefficients rather than set most of them to zero. It therefore gives rise to challenges in optimally determining the regularization parameter and physically identifying the important features. Moreover, it is critical to quantify the uncertainties associated with the prediction and EPI coefficients, particularly given lack of data.
In this paper, we propose a Bayesian view of regression with feature selection to address these challenges. Bayesian regression assumes the parameters and in Eq. (9
) to be the random variables, therefore the likelihood function can be written as:
(13) 
Bayesian regression can be also used to take regularization into consideration in the estimation procedure. Instead of identifying the optimal in a hard sense, Bayesian regression treats the regularization parameter as a random variable that can be estimated via the training data. This can be achieved by introducing hierarchical model with hyperparameters of the model. In the Bayesian setting, the target total energy
is assumed to be Gaussian distribution, which is given by:
(14) 
and the prior for the EPI coefficient is given by a Gaussian distribution
(15) 
Consequently, the regularization in Eq. (12) is equivalent to finding a maximum a posterior (MAP) estimation zhang2019efficient given a Gaussian prior over with precision
. Typically, a MAP estimation of the posterior distribution is obtained by Markov Chain Monte Carlo (MCMC) algorithm, which is often computationally intensive and difficult to converge for high dimensional problem. In this work, we consider a socalled conjugate prior for which the posterior distribution can be derived analytically and thus be more efficient. To this end, the priors over
andare selected to be gamma distribution
(16) 
where and
are the hyperparameters of the gamma priors over
and . We here select to be noninformative priors. All three random variables , and are estimated jointly using maximum likelihood estimate during the fit of the regression model. Note that Bayesian regularized regression performs more robust to illposed problems.2.4 Bayesian feature selection
To construct the effective Hamiltonian in HEAs, EPI model is employed to identify the coordination shells as the essential physical features. For each shell, there are independent subfeatures that is determined by the number of element species, for instance, for fourcomponent NbMoTaW HEAs and for fivecomponent NbMoTaWV HEAs. Liu and Zhang liu2019machine have shown that the first two shells associated with nearest neighbor pair and the next nearest neighbor pair have a more significant impact on the accuracy of prediction but longrange pair interactions with a larger perform weak influence. In practice, the truncation of coordination shells , needs to be carefully examined and determined for each specific material. In this work, Bayesian model selection method is applied to identify a “better” model complexity among a finite set of candidate EPI models.
Bayesian information criterion (BIC) is widely used for feature selection and measure the efficiency of the parameterized model in terms of predicting the data, which is defined as
(17) 
where is the maximized value of the model likelihood function, i.e. , where are the parameter values that maximize the likelihood function. is the number of observed data , is the number of parameters (features) of the model . BIC is derived by an efficient approximation using Laplace’s approach to approximate the evidence
, which is defined by Bayesian inference
zhang2018effect(18) 
BIC can be extended for linear regression under the assumption that the model errors are i.i.d. random variables that follow Gaussian distribution . The likelihood of can be written as
(19) 
Note that is the RSS of . Taking the derivative of with respect to and equate to zero yields the maximized value of and the corresponding log of is . As a result, the BIC in terms of RSS is given by
(20) 
This can be used to identify an appropriate number of physical feature, the coordination shell in EPI model, according to the intrinsic complexity present in a particular dataset. When selecting from a set of candidate models with various number of features, the one with lowest value is preferred.
2.5 Robust datadriven algorithm procedure
With the constituents outlined in the previous sections, the proposed methodology is summarized here and a flowchart is provided in Fig. 2.

Step 1: Data collection  An ensemble sampling strategy is used to combine the DFT data calculated with different sizes of the supercell. This benefits from incorporating different longrange order and shortrange order dataset.

Step 2: Feature identification  Given a set of random configurations of component alloy, the first step is to determine the independent pair interactions according to the EPI model. Then the probabilities , as physical features, defined in Eq. (7), are carefully calculated for the th coordination shell.

Step 3: Feature selection  To achieve an accurate prediction and reduce overfitting, feature selection using BIC in Eq. (20) is employed here to determine the truncated number of coordination shells for a specific HEAs. Note that each shell also includes subfeatures, which are either fully retained or truncated as a whole in the th shell.

Step 4: Bayesian regularized regression  Bayesian regression with regularization performs a robust prediction for the configurational energy given a specific configuration . Under the assumption of Gaussian distribution with conjugate prior, the uncertainty of the EPIs parameters are efficiently quantified, particularly given limited DFT data due to prohibitive computational cost.

Step 5: Error evaluation and model update  The performance of predicted model can be assessed by kfold cross validation (k=5) with the rootmeansquare error (RMSE) metric , which is defined as:
(21) where is the true value of energy by DFT and is the predicted energy using the proposed methodology. If the RMSE metric is smaller than a specific threshold , for example, = 1 meV, the predictive accuracy is acceptable, otherwise additional DFT data are required and back to Step 1 to update the algorithm and further improve the performance.

Step 6: Monte Carlo calculations of thermodynamics  If the prediction is accurate and stable enough, we can feed this efficient fitted model as a surrogate into Monte Carlo simulation for modeling thermodynamics and orderdisorder phase transitions. But this step is not the focus of this paper and the interested reader can find more discussions in the recent review literature widom2018modeling; eisenbach2019first
3 Results and discussions
3.1 Prediction of configurational energy
In this work, we systematically investigate three HEAs, including NbMoTaW, NbMoTaWV and NbMoTaWTi. The locally selfconsistent multiple scattering (LSMS) method wang1995order is used here for the calculation of total energy, with supercell of 16, 32, 64 and 128 for NbMoTaW and 20, 40, 80 and 160 for NbMoTaWV and NbMoTaWTi respectively. Fig. 3 shows the bcc supercell lattice of NbMoTaW (128 atoms), NbMoTaWV and NbMoTaWTi (160 atoms). For each supercell size, 200 configurations are randomly drawn and the corresponding energy are calculated by DFT method. Three smaller supercells with a total of 600 data are selected as the training dataset for Bayesian regularized regression and the largest supercell with 200 data are chosen for testing purpose. Six coordination shells in EPI model is chosen for this case. Fig. 4 shows a comparison of predicted energy with DFT calculated energy for three HEAs and the corresponding training and testing RMSE are illustrated in Table 1. For these three HEAs, the testing RMSEs 0.6 meV show that the learned model is accurate and robust for a system described by a relatively large supercell size.
HEA  NbMoTaW  NbMoTaWV  NbMoTaWTi 

Training (meV)  0.335  0.710  1.400 
Testing (meV)  0.632  0.647  0.665 
Typically, with the increasing of supercell size, the configurational systems tend to transit from ordered to disordered state. Due to the periodic boundary condition, the configurations drawn from smaller supercells often include different longrange order, while the samples obtained with larger supercells contain a various degree of shortrange order. As a result, it is highly possible that a random system with large supercell is not well represented by only using samples generated from small supercells, which is commonly used because of its efficiency but may result in loss of physical information at the thermodynamic limit. To conduct a robust datadriven approach, we adopt an ensemble random sampling strategy that combines the data from different supercells. This simple yet efficient technique aims to obtain a training dataset with different degrees of order and disorder such that the data are more representative.
The benefit of ensemble sampling strategy can be seen in Fig. 5. All 200 data from relatively large supercell (128 for NbMoTaW and 160 for NbMoTaWV and NbMoTaWTi) are selected as the testing dataset. Total 150 data randomly drawn from each smaller supercells (16, 32, 64 for NbMoTaW and 20, 40, 80 for NbMoTaWV and NbMoTaWTi) are selected as the training dataset. The ensemble sampling strategy performs an ensemble of dataset which is achieved by randomly drawn 50 data using Latin hypercube sampling shields2016generalization
from each of three supercells. In terms of these four cases, 100 random trials are carried out to estimate the standard deviation (error bar in Fig.
5) of the testing RMSE results. It can be easily seen that the mean and standard deviation of RMSE results are significantly large when only using the smallest supercell. The results underscore a fact that the configuration space in a random system is not well covered by training data only drawn from small supercells. The performance of 32 (40) and 64 (80) are better than that of 16 (20)atom supercell, showing that more degrees of shortrange and longrange order are captured by these training data. The ensemble sampling strategy (red bars) showing a minimal RMSE mean (1 meV) and standard deviation beat the other three cases in terms of the accuracy and stability. This robust strategy plays a substantial role in the datadriven modeling of the configurational energy such that the subsequent Monte Carlo simulation based on this efficient Hamiltonian can safely explore the whole regions of configuration space.3.2 Uncertainty in effective pair interaction bonds
The coordination shells, as the physical features, have a pivotal influence on the EPI model and their impact can be analyzed from the EPI parameters, as shown in Fig. 6  Fig. 8. For all three refractory HEAs, the first two shells, involving the nearest and nextnearest neighbor interactions are dominant, while the 3rd to 6th shells present a less essential role. A comparison between NbMoTaW and the other two HEAs, NbMoTaWV and NbMoTaWTi shows that the EPIs of NbMoTaW, as shown in Fig. 6 (c), is relatively stable and shortranged interacted due to small magnitude associated with the longranged shells, while the NbMoTaWV and NbMoTaWTi, as shown in Fig. 7 (c) and Fig. 8 (c) are more frustrated and longranged, with certain contribution from up to the 6th shell.
Moreover, three different dataset sizes, and 800 are shown here to investigate the effect of data on the EPI parameters (chemical bonds). The ensemble sampling strategy is used herein such that we collect 25, 100 and 200 data from each of four supercell sizes, i.e. 16, 32, 64 and 128 for NbMoTaW and 20, 40, 80 and 160 for NbMoTaWV and NbMoTaWTi. Given small dataset size, for example, , the trend of EPI bonds is consistent  the first two shells are dominant, but the values still have a discrepancy from larger dataset size, for example, . In other words, the uncertainties associated with the EPI bonds are primarily caused by a lack of data. This is reflected in the variancecovariance matrix of EPI bonds, as shown in Fig. 6  Fig. 8 (a). It is easily seen that the variance in 1st shells is the largest, followed by the 2nd shell and the 6th shell, which are larger than the other shells. The covariance values trending to zero demonstrates that there is no strong correlation among each of shells, which is agreed with the independent assumption.
As the data set size increases, from to , the EPI bonds become more stable and almost same as the case of . The corresponding variance is also significantly reduced by more than one order of magnitude, as shown in Fig. 6  Fig. 8 (b). Furthermore, the variancecovariance matrix presents an increasingly clear “pattern” with the increasing of dataset size and we therefore have some observations: a) it is clear to see that the pattern of variancecovariance matrix is divided by the identified physical feature (coordination shell); b) the EPI bonds in the nearestneighbor shell shows a stronger positive correlation, while the 6th shell illustrates a relatively large uncertainty with a negative correlation among the EPI bonds; c) the EPI bonds between the 1st shell and 4th shell have a negative correlation. Even though most of the EPI bonds are physically independent as we expected, the EPI bonds in a specific shell are not completely independent, particularly the nearestneighbor shell, and there is still slightly either positive or negative correlation between different shells.
3.3 Effect of physical feature selection
Due to lack of data, the uncertainties of the EPI bonds have been clearly quantified via Bayesian regularized regression as discussed above and these uncertainties will be propagated to the prediction of configurational energy and finally affect the thermodynamic quantities. It therefore gives rise to a critical issue that is how to systemically reduce the uncertainty and improve the robustness and reliability of prediction when only limited data are collected. To deal with this challenge, we introduce Bayesian information criterion (BIC) to conduct a physical feature selection and investigate the effect of coordination shells number on the prediction accuracy. Fig. 9  Fig. 11 show the BIC values for the different numbers of shells given various sizes of the training dataset. Using ensemble sampling strategy, we randomly collect data (from 20 data to 200 data) from each of four supercells by 100 random trials and then estimate the mean and standard deviation of BIC values for each specific number of shells. For NbTaMoW in Fig. 9 (a), corresponding to the smallest BIC value represents the “best” number of shells given only 20 data, while a larger number of leads to the issue of overfitting. As more data are collected into the model, the best number of shells gradually increases and converges towards . When relatively large data () is available, the mean of BIC values with are quite close and the variation of BIC in each is much smaller than the case of small dataset size. NbTaMoWV as shown in Fig. 10 has a similar overall trend with NbMoTaW but is identified for the case of . NbTaMoWTi also tends a smaller number of shells if only limited data is provided but it displays an early converged number of shells other than NbTaMoW and NbTaMoWV that are .
Table 2 provides a complete list of best number of shells given a specific dataset size (from 20 to 200 of each supercell) for these three refractory HEAs. If we divide the data into three categories, such as small size (total number of data , ), medium size () and large size (), we can briefly conclude as follows:

Small size: select a small number of shells or avoid overfitting

Medium size: identify or as a reliable choice avoid underfitting

Large size: determine the reasonable large number of feature due to the biasvariance tradeoff mehta2019high
Each supercell  NbMoTaW  NbMoTaWV  NbMoTaWTi 
20  2  3  2 
25  2  3  3 
30  5  5  3 
35  5  5  5 
40  5  5  5 
45  5  5  6 
50  6  5  6 
55  6  5  6 
60  6  5  6 
65  6  5  6 
70  9  5  6 
75  9  6  6 
100  9  9  6 
125  9  9  6 
150  9  9  6 
175  9  9  6 
200  9  9  6 
Next, let’s turn our attention to investigate the effect of feature selection on the testing RMSE performance. As shown in Fig. 12, we compare the RMSE results with three numbers of shells that include , and where is referred to as the best number of shells. It is easy to observe that the RMSE of ( meV), ( meV) are significantly larger than the results using the best number of shells that is only approximate meV given training data. The RMSE mean value of and is reduced to approximate the best shells as dataset size increases (see Fig. 12 (a)) but the variations (standard deviation) are still substantially large. When a relatively large data is collected, as shown in Fig. 12 (b) that has different yaxis scale from Fig. 12 (a)), the RMSE of and gradually converge to a small value, meV, while converges to meV and can not be further decreased. In other words, the shorter cutoff (for example ) for the physical feature does not fully capture all the physical information and underestimates the system complexity of HEAs.
Fig. 13 and Fig. 14 show the RMSE results for NbMoTaWV and NbMoTaWTi respectively. Given a small size of data (), the RMSE of and are nearly more than meV, which may lead to a large bias for Monte Carlo simulation of thermodynamics. Nevertheless, the best shells with the same data show a relatively small ( meV) and reliable RMSE estimate. In the medium size of data, the RMSE of and are reduced but still larger than that of the best shells, which can be observed from Fig. 13 (a) and Fig. 14(a). When relatively large data is considered, and in NbMoTaWTi eventually converge towards the minimal value as the best shells, while in NbMoTaWV still shows slight underfitting issue because the best number identified by BIC is greater than the arbitrary choice of . Through careful analysis and comparison of these three HEAs, we found that the model with the best number of shells identified by feature selection demonstrates a highly accurate and robust performance on either small or relatively large data. Rather than an arbitrary selection of physical feature, a reliable feature selection can effectively reduce the risk of underfitting and overfitting during the prediction.
4 Conclusions
In this work, we develop a systematically robust Bayesian framework to discover efficient and accurate modeling of configurational energy from the datadriven perspective. A shortrange pair interaction model is used here to characterize the physical feature and wellsuited to deal with a large number of components inherent with the multicomponent systems. Bayesian regression algorithm is employed to establish an efficient Hamiltonian through a set of random configurations with the corresponding energy calculated by DFT method and to quantify the uncertainties and correlations of effective pair interaction parameters. To improve the accuracy and reliability of prediction, we further perform Bayesian feature selection for dealing with the truncation of the model, specifically given lack of data. All three HEAs, NbTaMoW, NbTaMoWV and NbTaMoWTi have demonstrated a highly accurate and robust performance in predicting the configurational energy. The proposed method is therefore a powerful tool for studying the thermodynamics and orderdisorder phase transitions through the subsequent Monte Carlo simulations.
Specifically, we find that a small and single supercell is unable to well explore the various order and disorder in multicomponent systems. We therefore propose an ensemble sampling strategy which naturally incorporates chemical configurations of different shortrange and longrange order, thus performs a wellsuited sampling capability of the huge configuration space so that it has demonstrated to be a simple yet robust scheme to enhance the representativity of data. Using this strategy, the resulted RMSE of three HEAs show a very small mean value () with a tiny standard deviation that is significantly lower than the other cases that use only single supercell.
Also, we note that the chemical bonds in the EPI model show a frustrating behavior if the dataset size is too small (). This can also be observed from the uncertainty quantification of EPI bonds, for instance, the nearestneighbor shell displays a high level. However, as more data are obtained, for example, from to , the bonds tend to be stable and the uncertainties are reduced. Moreover, we find a clear pattern from the variancecovariance matrix that characterizes each coordination shell as the individual physical feature consisting of several subfeatures that depends on the number of component species. We also notice a certain degree of correlation within the nearestneighbor shell even though they are essentially assumed to be independent.
Finally, the impact of feature selection and the effect of dataset size are carefully discussed. For each material, the best truncated number of coordination shell is slightly different given a specific dataset but they demonstrate a similar trend. We therefore provide a general suggestion according to the dataset size: a) for small size (), for medium size () and for relatively large size (). Using this feature selection scheme, we have demonstrated an accurate and robust performance for all three HEAs without concerning the issue of underfitting or overfitting that is often happened in machine learning modeling.
5 Acknowledgements
This work of J. Z. was supported by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory. X. L. and M. E. were supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Science and Engineering Division. This research used resources of the Oak Ridge Leadership Computing Facility, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC0500OR22725.
Comments
There are no comments yet.