Robust data-driven approach for predicting the configurational energy of high entropy alloys

08/10/2019 ∙ by Jiaxin Zhang, et al. ∙ 9

High entropy alloys (HEAs) have been increasingly attractive as promising next-generation materials due to their various excellent properties. It's necessary to essentially characterize the degree of chemical ordering and identify order-disorder transitions through efficient simulation and modeling of thermodynamics. In this study, a robust data-driven framework based on Bayesian approaches is proposed and demonstrated on the accurate and efficient prediction of configurational energy of high entropy alloys. The proposed effective pair interaction (EPI) model with ensemble sampling is used to map the configuration and its corresponding energy. Given limited data calculated by first-principles calculations, Bayesian regularized regression not only offers an accurate and stable prediction but also effectively quantifies the uncertainties associated with EPI parameters. Compared with the arbitrary determination of model complexity, we further conduct a physical feature selection to identify the truncation of coordination shells in EPI model using Bayesian information criterion. The results achieve efficient and robust performance in predicting the configurational energy, particularly given small data. The developed methodology is applied to study a series of refractory HEAs, i.e. NbMoTaW, NbMoTaWV and NbMoTaWTi where it is demonstrated how dataset size affects the confidence we can place in statistical estimates of configurational energy when data are sparse.



There are no comments yet.


page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As one of the typical multicomponent alloys, high entropy alloys (HEAs) consisting of four or more principal elements have been widely studied due to their exceptional mechanical properties yeh2004nanostructured; senkov2011mechanical; zhang2014microstructures; li2016metastable. The increased number of elements expand the possible combinations and potential candidates for discovering next-generation materials with enhanced properties tsai2014high; miracle2017critical; gao2016high

. Typically, the material properties are inherently linked to the actual state of chemical ordering, much efforts have been therefore devoted to analyze the degree of chemical ordering and to identify the order-disorder phase transitions

widom2014hybrid; gao2016high; murty2019high; eisenbach2019first. Due to expensive time costs in experimental research, computational simulations, typically first-principles calculations are playing an increasingly central role in the investigation of various properties of HEAs widom2018modeling; gao2017computational; ikeda2019ab.

First-principles density functional theory (DFT) methods have established as a powerful and reliable tool in computational material science and have enabled critical advancements in materials properties and performance discovery ye2016high; ma2015ab. With the increasing numerical efficiency and growing computing power (parallel and GPU computing), it is still difficult to address the challenge of DFT calculations in relatively large supercells (thousands of atoms) and intensive sampling (huge number of configurations) huang2017construction. To characterize the order-disorder phase transition, a straightforward way is to combine the DFT method with Monte Carlo simulations. However, this “brute-force” method is so computationally intensive that it is often impractical, even for a simple example, 250-atom CuZn alloy khan2016density. Consequently, it is quite necessary to establish an approximate configurational energy model fitted to DFT data and feed this accurate enough and efficient “surrogate” model into Monte Carlo simulations for modeling thermodynamics and order-disorder phase transitions.

A typical and effective strategy is cluster expansion (CE) method kikuchi1951theory; sanchez1984generalized; van2002automating, which uses discrete sum representation of material properties, for example, configurational energies, in terms of lattice site configuration and site effective cluster interactions (ECIs), such as site pairs, triplets, quadruplets and the other high order interactions. The fundamental challenge in constructing CE model is to determine the ECIs in an efficient and robust way through a fit to reference configuration. Commonly used fitting algorithm is to minimize the overall difference between the CE fitted energy and DFT calculated energies with respect to different input configurations. Practically, CE has to be truncated and many advanced methods have been proposed to aid in the efficient and accurate performance of a truncated CE. These methods include as compressive sensingnelson2013compressive, Bayesian method mueller2009bayesian, cluster basis set selectionblum2005using; hart2005evolutionary; seko2009cluster, machine learning natarajan2018machine and regularization chang2019clease; aangqvist2019icet. Although the CE is effective Hamiltonian and can be potentially combined with Monte Carlo simulation to account for order-disorder phase transitions and chemical short-range order, its application to multicomponent systems is still intractable due to rapid increased combinatorial number of interatomic interactions concerning chemical elements jiang2016efficient; seko2009cluster. Therefore, fitting a CE for multicomponent alloys, i.e. HEAs, becomes extremely difficult widom2018modeling.

The recent development of machine learning presents exciting opportunities and challenges to various scientific fields jordan2015machine; lecun2015deep. Benefiting from advanced learning algorithms and large databases using high-throughput computations, machine learning has been widely applied to materials research and discoverybutler2018machine; mueller2016machine; sanchez2018inverse. Some examples of successful applications include discovering complex materials behaviorkim2016organized; raccuglia2016machine, accurate prediction of phase transitions and predictioncarrasquilla2017machine; huang2019machine; kostiuchenko2019impact, accelerated material design and prediction of material propertiesfujimura2013accelerated; pilania2013accelerating; ward2016general, modeling of various physical quantities, for instance, interatomic potentials dragoni2018achieving; chmiela2017machine; deringer2017machine and atomic forceschmiela2017machine; li2015molecular. Compared to successful applications in other fields, few studies have been conducted in the context of machine learning for the modeling of thermodynamics of HEAs. This is because the inherent challenges originated from the extremely large configuration space associated with the multicomponent alloys. In many cases, only small dataset is drawn from expensive DFT calculations due to limited computational cost, it therefore gives rise to the issue of uncertainty quantification in model inference and certified predictions aldegunde2016quantifying; kristensen2014bayesian; zhang2018quantification. The learned model also faces additional challenges to capture the underlying physics with important features and cover the overall configurational space in an accurate and robust scheme liu2019machine.

In this work, we develop an efficient and robust Bayesian framework by fitting an accurate, feature-selected efficient Hamiltonian which is employed in subsequent Monte Carlo simulations for modeling the thermodynamics and order-disorder phase transitions. Bayesian regularized regression is employed to deal with the unstable prediction due to sparse data, on the other hand, to effectively quantify the uncertainties associated with the EPI parameters. To investigate the impact of model complexity, we conduct physical feature selection using Bayesian information criterion that allows for effective truncation of the coordination shells given a specific dataset. We demonstrate the accuracy and reliability of prediction with feature selection are significantly higher than the prediction with an arbitrary truncation, specifically when data are limited. This first section of this paper presents a brief overview of classical cluster expansion and the proposed effective pair interaction model liu2019machine. Then we propose a robust data-driven framework which consists of Bayesian regularized regression and Bayesian feature selection that enables to effectively reduce the model complexity that is very difficult to conventional CE model in HEAs. In the second section, we apply the proposed robust algorithm to three refractory HEAs, i.e. NbMoTaW, NbMoTaWV and NbMoTaWTi. The ensemble random sampling performs well in predicting of configurational energy compared with the approach that uses only single supercell. Due to the limitation of dataset size, we systematically present the uncertainty quantification and correlation analysis of model parameters in terms of the Bayesian framework. Moreover, the effect of physical feature selection is carefully investigated for these three HEAs. Finally, the conclusions of the current work are summarized.

2 Theory and algorithm

2.1 Re-visited cluster expansion method

cluster expansion (CE) method is widely used approach for thermodynamic simulation of binary alloys due to its versatility and simplicity. Specifically, a binary alloy of

sites can be represented as a vector of occupation configurations,

where ‘spin’ variable takes a value of or depending on the occupant (atom X or Y) of site . A property of this binary crystal that depends on can be formulated as a polynomial expansion in terms of occupation configurations


where the expansion coefficients are called the effective cluster interactions (ECIs) which are independent of the configurations and often determined by the crystal structure and chemistry of the binary alloy, is a constant that represents the empty cluster and is the -site cluster function, defined as the product of basis function , which is given by:


Note that the cluster functions in Eq. (2) form a complete orthonormal basis on the configuration space . When all possible cluster functions are considered in CE model, Eq. (1) is an exact expression. However, a more practical way for CE is a truncated summation over finite number of cluster functions considering the many sites that are far apart are usually negligible. Typically, the energy is primarily determined by short-range interactions, it is therefore natural to represent the energy as a summation of interactions whose strength diminished with increasing range, which is given by:


where the , and represent the interaction strength of point clusters, pair clusters and triplet clusters, and can be determined by the DFT calculated total energies of different configurations in a variety of supercells. One can therefore utilize CE as an efficient Hamiltonian in Monte Carlo simulation to reveal order-disorder phase transitions.

However, it is often a challenging task to fit the ECIs of CE in multicomponent crystalline solids because of the number of terms scales as for an -body terms for species. The series need many terms and the number of terms grows rapidly with the diameter of the cluster. As additional terms are added, the series coefficients may converge poorly given limited number of configurations.

2.2 Effective pair interaction model

Conventional CE method is difficult when applied to HEAs, we herein propose to use an Ising-like model with only effective pair interactions (EPIs) without considering high-order interactions. Fig. 1 shows the prototype square lattice with effective pair interactions. In terms of the pair distance, we define a series of short-range pair interactions, for example, the nearest-neighbor, the next nearest-neighbor and so on. The orbit of a specific pair centered around site consists of all the equivalent pair interactions. Therefore, the effective Hamiltonian at lattice site can be expressed as:


where is the interatomic pair potential between element and , is referred to as element at site , is the number of coordination shell separating between and , is the occupation parameter, and is the concentration dependent part, which can be discarded for a given composition. Summing up the Hamiltonian over all atomic sites yields the total energy, which is given by:


where is the percentage of bonds in the -th coordination shell. Considering an -component alloy system, the total number of different chemistry bonds in -th shell is but there are constraints from the concentration of each element for a fixed chemical composition, for example, - and -. As a result, the number of independent variables in an -component alloy system is


which consists of the nearest-neighbor short-range order (SRO) parameters that exist at -th shell for an -component alloy. The Warren-Cowley SRO parameters is defined as


where is the concentration of element , and

is the probability of finding element

at the -th neighbor shell of element . is a critical parameter to characterize the different chemical configurations. means the preference of form bonds at the -th shell, indicates the opposites and for each suggests to a completely random system. In fact, there is an effective pair interaction (EPI) corresponding to each SRO parameter. Consequently, Eq. (5) can be further written by


where is closely related to the SRO parameter in Eq. (7). For example, for the four-component Nb-Mo-Ta-W refractory HEAs, there are total ten different bonds for each coordination shell but only six independent bonds, including Nb-Mo, Nb-Ta, Nb-W, Mo-Ta, Mo-W and Ta-W. Given a specific configuration of multicomponent HEAs Nb-Mo-Ta-W, it is not difficult to calculate the , , , , and at the -th neighbor shell. The corresponding interatomic pair coefficients , , , , and at the -th neighbor shell in Eq.(8

) can be determined by linear regression using

as the featuresliu2019machine. The cost of building an EPI model comes primarily from the cost of generating the training dataset, which is usually by DFT to which the EPIs are fitted. It therefore gives rise to two critical questions: 1) how to conduct an accurate and robust prediction that minimizes the error of energy for a given training dataset and 2) how to determine the number of physical feature when data are sparse. These would correspond to cluster selection in which the optimal set of clusters is selected for inclusion in the expansion in a robust way that minimizes the expected prediction error.

Figure 1: Square lattice with effective pair interaction highlighted. (a) the nearest-neighbor pair is marked in blue, while the next nearest-neighbor pair is marked in yellow; (b) the pair marked in green, pink and red correspond to the 3rd, 4th and 5th neighbor respectively. Equivalent interacted pairs (same distance) are marked in the same color. 

2.3 Bayesian regularized regression

To obtain a EPI for a specific HEAs one must determine the interatomic pair potential in Eq. (8), which can be cast in a matrix form,


where are independent and identically distributed (i.i.d.) variables that follow . is a matrix containing the probability quantities of the training data where each element in row at the -th shell is defined as


where is the number of training data (configurations). It is necessary to note that there are column vectors for each shell and is therefore a matrix. is a column vector in which the th element in the physical quantity (for example, total energy) of the configuration and is a column vector in which -th shell is which also includes elements. Determining the EPIs by solving the linear system given by Eq. (10), it is equivalent to finding the parameter vector , which minimizes the residual sum of squared errors (RSS)

using ordinary least squares (OLS) method. Typically, OLS method often has low bias but larger variance. A solution can be determined by OLS method, which performs well in the overdetermined system. Due to the large computational cost in DFT calculations, the linear system in Eq. (

10) is, however, often underdetermined and therefore leads to an ill-posed problem. Another drawback associated with OLS method is the susceptibility to possible overfitting chang2019clease, which refers to that the EPIs values are over-tuned to predict physical quantity in training dataset but losing the predictability for the new configurations that are “unseen” before. Meanwhile, the nearsightedness of physical interaction in CE suggests sparse property for nelson2013compressive.

Regularization is an effective way to combat overfitting and achieve sparse solutions by adding a regularization term in the form of or norm. For regularization, the optimal EPI values can be solved by


where is a penalty parameter that determines the amount of regularization. The primary benefit of regularization is its promotion of sparsity, which is achieved by feature selection with a set of EPI values set to zero. However, in principle, this shrinkage is performed only based on the correlation of feature but ignoring the inherent physical interactions such that sometimes it incorrectly forces the EPI parameters to zero, consequently leads to a unstable prediction, specifically under the case of sparse training dataset. Instead, this study prefers to use the penalty for both fitting and penalization of the EPI coefficients. Thus, the solution becomes


which is the most popular technique for improving prediction accuracy by shrinking large regression coefficients to reduce overfitting. Unlike regularization, regularization tends to contain all physical interaction information by only shrinking the size of EPI coefficients rather than set most of them to zero. It therefore gives rise to challenges in optimally determining the regularization parameter and physically identifying the important features. Moreover, it is critical to quantify the uncertainties associated with the prediction and EPI coefficients, particularly given lack of data.

In this paper, we propose a Bayesian view of regression with feature selection to address these challenges. Bayesian regression assumes the parameters and in Eq. (9

) to be the random variables, therefore the likelihood function can be written as:


Bayesian regression can be also used to take regularization into consideration in the estimation procedure. Instead of identifying the optimal in a hard sense, Bayesian regression treats the regularization parameter as a random variable that can be estimated via the training data. This can be achieved by introducing hierarchical model with hyper-parameters of the model. In the Bayesian setting, the target total energy

is assumed to be Gaussian distribution, which is given by:


and the prior for the EPI coefficient is given by a Gaussian distribution


Consequently, the regularization in Eq. (12) is equivalent to finding a maximum a posterior (MAP) estimation zhang2019efficient given a Gaussian prior over with precision

. Typically, a MAP estimation of the posterior distribution is obtained by Markov Chain Monte Carlo (MCMC) algorithm, which is often computationally intensive and difficult to converge for high dimensional problem. In this work, we consider a so-called conjugate prior for which the posterior distribution can be derived analytically and thus be more efficient. To this end, the priors over


are selected to be gamma distribution


where and

are the hyperparameters of the gamma priors over

and . We here select to be non-informative priors. All three random variables , and are estimated jointly using maximum likelihood estimate during the fit of the regression model. Note that Bayesian regularized regression performs more robust to ill-posed problems.

2.4 Bayesian feature selection

To construct the effective Hamiltonian in HEAs, EPI model is employed to identify the coordination shells as the essential physical features. For each shell, there are independent sub-features that is determined by the number of element species, for instance, for four-component Nb-Mo-Ta-W HEAs and for five-component Nb-Mo-Ta-W-V HEAs. Liu and Zhang liu2019machine have shown that the first two shells associated with nearest neighbor pair and the next nearest neighbor pair have a more significant impact on the accuracy of prediction but long-range pair interactions with a larger perform weak influence. In practice, the truncation of coordination shells , needs to be carefully examined and determined for each specific material. In this work, Bayesian model selection method is applied to identify a “better” model complexity among a finite set of candidate EPI models.

Bayesian information criterion (BIC) is widely used for feature selection and measure the efficiency of the parameterized model in terms of predicting the data, which is defined as


where is the maximized value of the model likelihood function, i.e. , where are the parameter values that maximize the likelihood function. is the number of observed data , is the number of parameters (features) of the model . BIC is derived by an efficient approximation using Laplace’s approach to approximate the evidence

, which is defined by Bayesian inference



BIC can be extended for linear regression under the assumption that the model errors are i.i.d. random variables that follow Gaussian distribution . The likelihood of can be written as


Note that is the RSS of . Taking the derivative of with respect to and equate to zero yields the maximized value of and the corresponding log of is . As a result, the BIC in terms of RSS is given by


This can be used to identify an appropriate number of physical feature, the coordination shell in EPI model, according to the intrinsic complexity present in a particular dataset. When selecting from a set of candidate models with various number of features, the one with lowest value is preferred.

2.5 Robust data-driven algorithm procedure

With the constituents outlined in the previous sections, the proposed methodology is summarized here and a flowchart is provided in Fig. 2.

  • Step 1: Data collection - An ensemble sampling strategy is used to combine the DFT data calculated with different sizes of the supercell. This benefits from incorporating different long-range order and short-range order dataset.

  • Step 2: Feature identification - Given a set of random configurations of -component alloy, the first step is to determine the independent pair interactions according to the EPI model. Then the probabilities , as physical features, defined in Eq. (7), are carefully calculated for the -th coordination shell.

  • Step 3: Feature selection - To achieve an accurate prediction and reduce overfitting, feature selection using BIC in Eq. (20) is employed here to determine the truncated number of coordination shells for a specific HEAs. Note that each shell also includes sub-features, which are either fully retained or truncated as a whole in the -th shell.

  • Step 4: Bayesian regularized regression - Bayesian regression with regularization performs a robust prediction for the configurational energy given a specific configuration . Under the assumption of Gaussian distribution with conjugate prior, the uncertainty of the EPIs parameters are efficiently quantified, particularly given limited DFT data due to prohibitive computational cost.

  • Step 5: Error evaluation and model update - The performance of predicted model can be assessed by k-fold cross validation (k=5) with the root-mean-square error (RMSE) metric , which is defined as:


    where is the true value of energy by DFT and is the predicted energy using the proposed methodology. If the RMSE metric is smaller than a specific threshold , for example, = 1 meV, the predictive accuracy is acceptable, otherwise additional DFT data are required and back to Step 1 to update the algorithm and further improve the performance.

  • Step 6: Monte Carlo calculations of thermodynamics - If the prediction is accurate and stable enough, we can feed this efficient fitted model as a surrogate into Monte Carlo simulation for modeling thermodynamics and order-disorder phase transitions. But this step is not the focus of this paper and the interested reader can find more discussions in the recent review literature widom2018modeling; eisenbach2019first

Figure 2: Flowchart of robust data-driven algorithm using Bayesian framework

3 Results and discussions

3.1 Prediction of configurational energy

In this work, we systematically investigate three HEAs, including NbMoTaW, NbMoTaWV and NbMoTaWTi. The locally self-consistent multiple scattering (LSMS) method wang1995order is used here for the calculation of total energy, with supercell of 16, 32, 64 and 128 for NbMoTaW and 20, 40, 80 and 160 for NbMoTaWV and NbMoTaWTi respectively. Fig. 3 shows the bcc supercell lattice of NbMoTaW (128 atoms), NbMoTaWV and NbMoTaWTi (160 atoms). For each supercell size, 200 configurations are randomly drawn and the corresponding energy are calculated by DFT method. Three smaller supercells with a total of 600 data are selected as the training dataset for Bayesian regularized regression and the largest supercell with 200 data are chosen for testing purpose. Six coordination shells in EPI model is chosen for this case. Fig. 4 shows a comparison of predicted energy with DFT calculated energy for three HEAs and the corresponding training and testing RMSE are illustrated in Table 1. For these three HEAs, the testing RMSEs 0.6 meV show that the learned model is accurate and robust for a system described by a relatively large supercell size.

Figure 3: Bcc supercells of refractory HEAs. (a) NbMoTaW with 128 atoms, (b) NbMoTaWV with 160 atoms and (c) NbMoTaWTi with 160 atoms
Figure 4: Comparison of DFT calculated energy with predicted energy using Bayesian regularized regression for (a) NbMoTaW, (b) NbMoTaWV and (c) NbMoTaWTi
Training (meV) 0.335 0.710 1.400
Testing (meV) 0.632 0.647 0.665
Table 1: Training and testing RMSE accuracy of configurational energy for three HEAs

Typically, with the increasing of supercell size, the configurational systems tend to transit from ordered to disordered state. Due to the periodic boundary condition, the configurations drawn from smaller supercells often include different long-range order, while the samples obtained with larger supercells contain a various degree of short-range order. As a result, it is highly possible that a random system with large supercell is not well represented by only using samples generated from small supercells, which is commonly used because of its efficiency but may result in loss of physical information at the thermodynamic limit. To conduct a robust data-driven approach, we adopt an ensemble random sampling strategy that combines the data from different supercells. This simple yet efficient technique aims to obtain a training dataset with different degrees of order and disorder such that the data are more representative.

Figure 5: Testing performance comparison between ensemble sampling strategy and sampling drawn from only single supercell. Blue, orange and green bars: testing results using 150 data only from one specific supercell (16, 32, 64 for NbMoTaW and 20, 40, 80 for NbMoTaWV and NbMoTaWTi respectively). Red bar: testing results using an ensemble of 150 data that consists of three 50 data drawn from each supercell.

The benefit of ensemble sampling strategy can be seen in Fig. 5. All 200 data from relatively large supercell (128 for NbMoTaW and 160 for NbMoTaWV and NbMoTaWTi) are selected as the testing dataset. Total 150 data randomly drawn from each smaller supercells (16, 32, 64 for NbMoTaW and 20, 40, 80 for NbMoTaWV and NbMoTaWTi) are selected as the training dataset. The ensemble sampling strategy performs an ensemble of dataset which is achieved by randomly drawn 50 data using Latin hypercube sampling shields2016generalization

from each of three supercells. In terms of these four cases, 100 random trials are carried out to estimate the standard deviation (error bar in Fig.

5) of the testing RMSE results. It can be easily seen that the mean and standard deviation of RMSE results are significantly large when only using the smallest supercell. The results underscore a fact that the configuration space in a random system is not well covered by training data only drawn from small supercells. The performance of 32 (40) and 64 (80) are better than that of 16 (20)-atom supercell, showing that more degrees of short-range and long-range order are captured by these training data. The ensemble sampling strategy (red bars) showing a minimal RMSE mean (1 meV) and standard deviation beat the other three cases in terms of the accuracy and stability. This robust strategy plays a substantial role in the data-driven modeling of the configurational energy such that the subsequent Monte Carlo simulation based on this efficient Hamiltonian can safely explore the whole regions of configuration space.

3.2 Uncertainty in effective pair interaction bonds

The coordination shells, as the physical features, have a pivotal influence on the EPI model and their impact can be analyzed from the EPI parameters, as shown in Fig. 6 - Fig. 8. For all three refractory HEAs, the first two shells, involving the nearest and next-nearest neighbor interactions are dominant, while the 3rd to 6th shells present a less essential role. A comparison between NbMoTaW and the other two HEAs, NbMoTaWV and NbMoTaWTi shows that the EPIs of NbMoTaW, as shown in Fig. 6 (c), is relatively stable and short-ranged interacted due to small magnitude associated with the long-ranged shells, while the NbMoTaWV and NbMoTaWTi, as shown in Fig. 7 (c) and Fig. 8 (c) are more frustrated and long-ranged, with certain contribution from up to the 6th shell.

Figure 6: Effective pair interaction (EPI) bonds and their uncertainties that are quantified by the variance-covariance matrix given different sizes of training dataset for NbMoTaW

Moreover, three different dataset sizes, and 800 are shown here to investigate the effect of data on the EPI parameters (chemical bonds). The ensemble sampling strategy is used herein such that we collect 25, 100 and 200 data from each of four supercell sizes, i.e. 16, 32, 64 and 128 for NbMoTaW and 20, 40, 80 and 160 for NbMoTaWV and NbMoTaWTi. Given small dataset size, for example, , the trend of EPI bonds is consistent - the first two shells are dominant, but the values still have a discrepancy from larger dataset size, for example, . In other words, the uncertainties associated with the EPI bonds are primarily caused by a lack of data. This is reflected in the variance-covariance matrix of EPI bonds, as shown in Fig. 6 - Fig. 8 (a). It is easily seen that the variance in 1st shells is the largest, followed by the 2nd shell and the 6th shell, which are larger than the other shells. The covariance values trending to zero demonstrates that there is no strong correlation among each of shells, which is agreed with the independent assumption.

Figure 7: Effective pair interaction (EPI) bonds and their uncertainties that are quantified by the variance-covariance matrix given different sizes of training dataset for NbMoTaWV
Figure 8: Effective pair interaction (EPI) bonds and their uncertainties that are quantified by the variance-covariance matrix given different sizes of training dataset for NbMoTaTi

As the data set size increases, from to , the EPI bonds become more stable and almost same as the case of . The corresponding variance is also significantly reduced by more than one order of magnitude, as shown in Fig. 6 - Fig. 8 (b). Furthermore, the variance-covariance matrix presents an increasingly clear “pattern” with the increasing of dataset size and we therefore have some observations: a) it is clear to see that the pattern of variance-covariance matrix is divided by the identified physical feature (coordination shell); b) the EPI bonds in the nearest-neighbor shell shows a stronger positive correlation, while the 6th shell illustrates a relatively large uncertainty with a negative correlation among the EPI bonds; c) the EPI bonds between the 1st shell and 4th shell have a negative correlation. Even though most of the EPI bonds are physically independent as we expected, the EPI bonds in a specific shell are not completely independent, particularly the nearest-neighbor shell, and there is still slightly either positive or negative correlation between different shells.

3.3 Effect of physical feature selection

Due to lack of data, the uncertainties of the EPI bonds have been clearly quantified via Bayesian regularized regression as discussed above and these uncertainties will be propagated to the prediction of configurational energy and finally affect the thermodynamic quantities. It therefore gives rise to a critical issue that is how to systemically reduce the uncertainty and improve the robustness and reliability of prediction when only limited data are collected. To deal with this challenge, we introduce Bayesian information criterion (BIC) to conduct a physical feature selection and investigate the effect of coordination shells number on the prediction accuracy. Fig. 9 - Fig. 11 show the BIC values for the different numbers of shells given various sizes of the training dataset. Using ensemble sampling strategy, we randomly collect data (from 20 data to 200 data) from each of four supercells by 100 random trials and then estimate the mean and standard deviation of BIC values for each specific number of shells. For NbTaMoW in Fig. 9 (a), corresponding to the smallest BIC value represents the “best” number of shells given only 20 data, while a larger number of leads to the issue of overfitting. As more data are collected into the model, the best number of shells gradually increases and converges towards . When relatively large data () is available, the mean of BIC values with are quite close and the variation of BIC in each is much smaller than the case of small dataset size. NbTaMoWV as shown in Fig. 10 has a similar overall trend with NbMoTaW but is identified for the case of . NbTaMoWTi also tends a smaller number of shells if only limited data is provided but it displays an early converged number of shells other than NbTaMoW and NbTaMoWV that are .

Figure 9: Bayesian information criterion results for different number of shells in NbMoTaW given a specific training dataset size, including (a) 20 data, (b) 30 data, (c) 40 data, (d) 50 data, (e) 75 data and (f) 150 data
Figure 10: Bayesian information criterion results for different number of shells in NbMoTaWV given a specific training dataset size, including (a) 20 data, (b) 30 data, (c) 40 data, (d) 50 data, (e) 75 data and (f) 150 data
Figure 11: Bayesian information criterion results for different number of shells in NbMoTaWTi given a specific training dataset size, including (a) 20 data, (b) 30 data, (c) 40 data, (d) 50 data, (e) 75 data and (f) 150 data

Table 2 provides a complete list of best number of shells given a specific dataset size (from 20 to 200 of each supercell) for these three refractory HEAs. If we divide the data into three categories, such as small size (total number of data ), medium size () and large size (), we can briefly conclude as follows:

  • Small size: select a small number of shells or avoid overfitting

  • Medium size: identify or as a reliable choice avoid underfitting

  • Large size: determine the reasonable large number of feature due to the bias-variance tradeoff mehta2019high

Each supercell NbMoTaW NbMoTaWV NbMoTaWTi
20 2 3 2
25 2 3 3
30 5 5 3
35 5 5 5
40 5 5 5
45 5 5 6
50 6 5 6
55 6 5 6
60 6 5 6
65 6 5 6
70 9 5 6
75 9 6 6
100 9 9 6
125 9 9 6
150 9 9 6
175 9 9 6
200 9 9 6
Table 2: The best number of coordination shells given a specific size of training dataset

Next, let’s turn our attention to investigate the effect of feature selection on the testing RMSE performance. As shown in Fig. 12, we compare the RMSE results with three numbers of shells that include , and where is referred to as the best number of shells. It is easy to observe that the RMSE of ( meV), ( meV) are significantly larger than the results using the best number of shells that is only approximate meV given training data. The RMSE mean value of and is reduced to approximate the best shells as dataset size increases (see Fig. 12 (a)) but the variations (standard deviation) are still substantially large. When a relatively large data is collected, as shown in Fig. 12 (b) that has different y-axis scale from Fig. 12 (a)), the RMSE of and gradually converge to a small value, meV, while converges to meV and can not be further decreased. In other words, the shorter cutoff (for example ) for the physical feature does not fully capture all the physical information and underestimates the system complexity of HEAs.

Figure 12: RMSE results of NbMoTaW with three selections of coordination shells number given different sizes of training dataset (a) and (b)

Fig. 13 and Fig. 14 show the RMSE results for NbMoTaWV and NbMoTaWTi respectively. Given a small size of data (), the RMSE of and are nearly more than meV, which may lead to a large bias for Monte Carlo simulation of thermodynamics. Nevertheless, the best shells with the same data show a relatively small ( meV) and reliable RMSE estimate. In the medium size of data, the RMSE of and are reduced but still larger than that of the best shells, which can be observed from Fig. 13 (a) and Fig. 14(a). When relatively large data is considered, and in NbMoTaWTi eventually converge towards the minimal value as the best shells, while in NbMoTaWV still shows slight underfitting issue because the best number identified by BIC is greater than the arbitrary choice of . Through careful analysis and comparison of these three HEAs, we found that the model with the best number of shells identified by feature selection demonstrates a highly accurate and robust performance on either small or relatively large data. Rather than an arbitrary selection of physical feature, a reliable feature selection can effectively reduce the risk of underfitting and overfitting during the prediction.

Figure 13: RMSE results of NbMoTaWV with three selections of coordination shells number given different sizes of training dataset (a) and (b)
Figure 14: RMSE results of NbMoTaWTi with three selections of coordination shells number given different sizes of training dataset (a) and (b)

4 Conclusions

In this work, we develop a systematically robust Bayesian framework to discover efficient and accurate modeling of configurational energy from the data-driven perspective. A short-range pair interaction model is used here to characterize the physical feature and well-suited to deal with a large number of components inherent with the multicomponent systems. Bayesian regression algorithm is employed to establish an efficient Hamiltonian through a set of random configurations with the corresponding energy calculated by DFT method and to quantify the uncertainties and correlations of effective pair interaction parameters. To improve the accuracy and reliability of prediction, we further perform Bayesian feature selection for dealing with the truncation of the model, specifically given lack of data. All three HEAs, NbTaMoW, NbTaMoWV and NbTaMoWTi have demonstrated a highly accurate and robust performance in predicting the configurational energy. The proposed method is therefore a powerful tool for studying the thermodynamics and order-disorder phase transitions through the subsequent Monte Carlo simulations.

Specifically, we find that a small and single supercell is unable to well explore the various order and disorder in multicomponent systems. We therefore propose an ensemble sampling strategy which naturally incorporates chemical configurations of different short-range and long-range order, thus performs a well-suited sampling capability of the huge configuration space so that it has demonstrated to be a simple yet robust scheme to enhance the representativity of data. Using this strategy, the resulted RMSE of three HEAs show a very small mean value () with a tiny standard deviation that is significantly lower than the other cases that use only single supercell.

Also, we note that the chemical bonds in the EPI model show a frustrating behavior if the dataset size is too small (). This can also be observed from the uncertainty quantification of EPI bonds, for instance, the nearest-neighbor shell displays a high level. However, as more data are obtained, for example, from to , the bonds tend to be stable and the uncertainties are reduced. Moreover, we find a clear pattern from the variance-covariance matrix that characterizes each coordination shell as the individual physical feature consisting of several sub-features that depends on the number of component species. We also notice a certain degree of correlation within the nearest-neighbor shell even though they are essentially assumed to be independent.

Finally, the impact of feature selection and the effect of dataset size are carefully discussed. For each material, the best truncated number of coordination shell is slightly different given a specific dataset but they demonstrate a similar trend. We therefore provide a general suggestion according to the dataset size: a) for small size (), for medium size () and for relatively large size (). Using this feature selection scheme, we have demonstrated an accurate and robust performance for all three HEAs without concerning the issue of underfitting or overfitting that is often happened in machine learning modeling.

5 Acknowledgements

This work of J. Z. was supported by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory. X. L. and M. E. were supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Science and Engineering Division. This research used resources of the Oak Ridge Leadership Computing Facility, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.