Hierarchical Visualization of Materials Space with Graph Convolutional Neural Networks

07/09/2018 ∙ by Tian Xie, et al. ∙ 0

The combination of high throughput computation and machine learning has led to a new paradigm in materials design by allowing for the direct screening of vast portions of structural, chemical, and property space. The use of these powerful techniques leads to the generation of enormous amounts of data, which in turn calls for new techniques to efficiently explore and visualize the materials space to help identify underlying patterns. In this work, we develop a unified framework to hierarchically visualize the compositional and structural similarities between materials in an arbitrary material space. We demonstrate the potential for such a visualization approach by showing that patterns emerge automatically that reflect similarities at different scales in three representative classes of materials: perovskites, elemental boron, and general inorganic crystals, covering material spaces of different compositions, structures, and both. For perovskites, elemental similarities are learned that reflects multiple aspects of atom properties. For elemental boron, structural motifs emerge automatically showing characteristic boron local environments. For inorganic crystals, the similarity and stability of local coordination environments are shown combining different center and neighbor atoms. The method could help transition to a data-centered exploration of materials space in automated materials design.



There are no comments yet.


page 3

page 4

page 5

page 7

page 10

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Efficient exploration of the materials space has been central to material discovery as a result of the limited experimental and computational resources compared with its vast size. Often compositional or structural patterns are sought from past experiences that might guide the design of new materials, improving the efficiency of material exploration Niu, Guo, and Wang (2015); Snaith (2013); Xu et al. (2013); Butler et al. (2013); Madelung (1964). Emerging high-throughput computation and machine learning techniques directly screen large amounts of candidate materials for specific applications Greeley et al. (2006); Senkan (1998); Potyrailo et al. (2011); Curtarolo et al. (2013); Hautier et al. (2010); Gómez-Bombarelli et al. (2016); Faber et al. (2016); Rupp et al. (2012), which enables fast and direct exploration of the material space. However, the large quantities of material data generated makes the discovery of patterns challenging with traditional, human-centered approaches. Instead, an automated, data-centered method to visualize and understand a given materials design phase space is needed in order to improve the efficiency of exploration.

The key in visualizing material space is to map materials with different compositions and structures into a lower dimensional manifold where the similarity between materials can be measured by their Euclidean distances. One major challenge in finding such manifolds is to develop a unified representation for different materials. A widely-used method is representing materials with feature vectors, where a set of descriptors are selected to represent each material 

Pilania et al. (2013); Meredig et al. (2014); Ward et al. (2016). There are also methods that automatically select descriptors that are best for predicting a desired target property Ghiringhelli et al. (2015). Recent work has also developed atomic-scale representations to map complex atom configurations into low dimensional manifolds, such as atom centered symmetry functions Behler (2011), social permutation invariant (SPRINT) coordinates Pietrucci and Andreoni (2011), global minimum of root-mean-square distance Sadeghi et al. (2013), smooth overlap of atomic positions (SOAP) Bartók, Kondor, and Csányi (2013), and many other methods Schütt et al. (2014); Faber et al. (2018); Glielmo, Zeni, and De Vita (2018). These representations often have physically meaningful parameters that can highlight some structural or chemical features. Often material descriptors and atomic representations are used together to combine compositional and structural information Ward et al. (2017); Faber et al. (2018). They have been used to visualize the material and molecular similaritiesIsayev et al. (2015); De et al. (2016); Musil et al. (2018), as well as explore the complex configurational space of biological systems Das et al. (2006); Ceriotti, Tribello, and Parrinello (2011); Spiwok and Králová (2011); Rohrdanz, Zheng, and Clementi (2013) and water structures Pietrucci and Martoňák (2015); Engel et al. (2018). In addition to Euclidean distances, similarity kernels are also used to compare material similarities De et al. (2016); Musil et al. (2018). Combined with machine learning algorithms, these representations were also used to predict material properties Ghiringhelli et al. (2015); Pilania et al. (2013); Meredig et al. (2014); Ward et al. (2016); Faber et al. (2016); Seko et al. (2015); Isayev et al. (2017); Schütt et al. (2014) and construct force fields Behler and Parrinello (2007); Bartók, Kondor, and Csányi (2013); Botu et al. (2016).

In parallel to these efforts, the success of “deep learning” has inspired a group of representations purely based on neural networks. Instead of designing descriptors or atomic representations that are fixed or contain several physically meaningful parameters, they use relatively general neural network architectures with a large number of trainable weights to learn a representation directly. This field started with building neural networks on molecular graphs 

Duvenaud et al. (2015); Kearnes et al. (2016); Gilmer et al. (2017); Schütt et al. (2017), and was recently expanded to periodic material systems by us Xie and Grossman (2018) and Schutt et al. Schütt et al. (2018). It has been shown that given large amounts of data, these methods can outperform many other representations on the task of predicting molecular properties Wu et al. (2018). However, the general neural network architecture may also limit performance when the data size is small since there is no material specific information built-in. It is worth noting that many machine learning force fields combine atomic representations and neural networks Blank et al. (1995); Behler and Parrinello (2007); Behler (2011), but they usually deal with different compositions separately and use a significantly smaller number of weights. It has been shown that the hidden layers of these neural networks can learn physically meaningful representations by proper design of the network architecture. For instance, several works have investigated the ideas of learning atom energies Xie and Grossman (2018); Deringer, Pickard, and Csányi (2018); Schütt et al. (2017) and elemental similarities Schmidt et al. (2017); Zhou et al. (2018). In addition, recent work showed that element similarities can also be learned using a specially designed SOAP kernel Willatt, Musil, and Ceriotti (2018).

In this work, we aim to develop a unified framework to hierarchically visualize the compositional and structural similarities between materials in an arbitrary material space with representations learned from different layers of the neural networks. The network is based on a variant of our previously developed crystal graph convolutional neural networks (CGCNN) framework Xie and Grossman (2018), but it is designed to focus on presenting the similarities between materials at different scales, including elemental similarities, local environment similarities, and local energies. We apply this approach to visualize three material spaces: perovskites, elemental boron, and general inorganic crystals, covering material spaces of different compositions, different structures, and both, respectively. We show that in all three cases pattern emerges automatically that might aid in the design of new materials.

Ii Methods

Figure 1: The structure of the crystal graph convolutional neural networks.

To visualize the crystal space at different scales, we design a variant of CGCNN Xie and Grossman (2018) that has meaningful interpretation at different layers of the neural network. The learned CGCNN network provides a vector representation of the local environments in each crystal that only depends on its composition and structure without any human designed features, enabling us to explore the materials space hierarchically.

We first represent the crystal structure with a multigraph that encodes the connectivity of atoms in the crystal. Each atom is represented by a node in which stores a vector corresponding to the element type of the atom. To avoid introducing any human bias, we set to be a random 64 dimensional vector for each element and allow it to evolve during the training process. Then, we search for the 12 nearest neighbors for each atom and introduce an edge between the center node and neighbor . The subscript indicates that there can be multiple edges between the same end nodes as a result of the periodicity of the crystal. The edge stores a vector whose th element depends on the distance between and by,


where for and .

In graph , each atom is initialized by a vector whose value solely depends on the element type of atom . We call this iteration 0 where


Then, we perform convolution operations on the multigraph with the convolution function designed in Ref. Xie and Grossman (2018) which allows atom to interact with its neighbors iteratively. In iteration , we first concatenate neighbor vectors , and then perform the convolution by,


where denotes element-wise multiplication,

denotes a sigmoid function, and

denotes any non-linear activation function, and

and denotes weights and biases in the neural network, respectively. During these convolution operations, forms a series of representations of the local environments of atom at different scales.


iterations, we perform a linear transformation to map

to a scalar ,


and then use a normalized sum pooling to predict the averaged total energy per atom of the crystal,


where is the number of atoms in the crystal. This introduces a physically meaningful term to represent the energy of the local chemical environment.

The model is trained by minimizing the squared error between predicted properties relative to the DFT calculated properties using backpropagation and stochastic gradient descent.

In this CGCNN model, each vector represents the local environment of each atom at different scales. Here, we focus three vectors that has the most representative physical interpretations.

  1. Element representation that depends completely on the type of element that atom is composed of, describing the similarities between elements.

  2. Local environment representation that depends on atom and its th order neighbors, describing the similarities between local environments that combines the compositional and structural information.

  3. Local energy representation that describes the energy of atom .

Iii Results and Discussions

To illustrate how this method can help visualize the compositional the structural aspects of the crystal space, we apply it to three datasets that representing different material spaces. 1) a group of perovskite crystals that share the same structure type but have different compositions; 2) different configurations of elemental boron that share the same composition but have different structures; and 3) inorganic crystals from the Materials Project Jain et al. (2013) that have both different compositions and different structures.

For each material space, we train the CGCNN model with 60% of the data to predict the energy per atom of the materials. 20% of the data are used to select hyperparameters of the model and the last 20% are reserved for testing. In Fig. 

2, we show the learning curves for the three representative material spaces where a subset of training data is used to show how the number of training data affects the model prediction performance. As we will show below, the representations learned by predicting the energies automatically gain physical meanings and can be used to explore the materials spaces.

Figure 2: Learning curves for the three representative material spaces. The mean absolute errors (MAEs) on test data is shown as a function of the number of training data for the perovskites Castelli et al. (2012a, b), elemental boron Deringer, Pickard, and Csányi (2018), and materials project Jain et al. (2013) datasets.

iii.1 Perovskite: compositional space

Figure 3:

Visualization of the element representations learned from the perovskite dataset. (a) The perovskite structure type. (b) Visualization of the two principal dimensions with principal component analysis. (c) Prediction performance of several atom properties using a linear model on the element representations.

First, we explore the compositional space of perovskites by visualizing the element representations. Perovskite is a crystal structure type with the form of ABC_3 as shown in Fig. 3(a). The dataset Castelli et al. (2012a, b) that we used includes 18,928 different perovskites where the elements A and B can be any nonradioactive metals and the element C can be one or several from O, N, S, and F. We trained our model to predict the energy above hull with 15,000 training data, and after hyperparameter optimization on 1,890 validation data, we achieve a prediction mean absolute error (MAE) of 0.042 eV/atom on 2,000 test data. The prediction performance is excellent and lower than several recent ML models such as those of Schmidt et al. (0.121 eV/atom) Schmidt et al. (2017) and Xie et al. (0.099 eV/atom) Xie and Grossman (2018). The learning curve in Fig. 2 shows a straight line in log-log scale, indicating a steady increase of prediction performance as the number of training data increases.

In Fig. 3(b)(c), the element representation , a 64 dimensional vector, is visualized for every nonradioactive metal element after training with the perovskite dataset. Fig. 3(b) shows the projection of these element representations on a 2D plane using principal component analysis, where elements are colored according to their elemental groups. We can clearly see that similar elements are grouped together based on their stability in perovskite structures. For instance, alkali metals are grouped on the right of the plot due to their similar properties. The large alkaline earth metals (Ba, Se, and Ca) are grouped on the bottom, distinct from Mg and Be, because their larger radius stabilizes them in the perovskite structure. On the left side are elements such as W, Mo, and Ta that favor octahedral coordinations due to their configuration of d electrons, which might be related to their extra stability in the B site Xie and Grossman (2018). Interestingly, we can also observe a trend of decreasing atom radius from the bottom of the plot to the top as shown in the insert of Fig. 3

(b), except for the alkali metals as outliers. This indicates that CGCNN learns the atom radius as an important feature for perovskite stability. Recently, Schutt et al. also discovered similar grouping of elements with data from the Materials Project 

Schütt et al. (2018). In general, these visualizations can help discover similarities between elements for designing novel perovskite structures.

We also study how the element representations evolve as the number of training data changes. In Fig. S1, we show the 2D projections of the element representations when 234, 937, 3,750, and 15,000 training data are used, respectively. The projection looks completely random with 234 training data, and some patterns start to emerge when 937 training data are used. In Fig. S1(b), transition metals are grouped on top of the figure while large metals like La, Ca, Sr, Ba, and Cs are grouped at the bottom. With 3,750 training data, the figure is already close to Fig. 3(b) and the relation between atom radius and the second dimension is clear. Fig. 3(b) and Fig. S1(d) are almost identical after rotations because they both use 15,000 training data. Note that these representations start from different random initializations, but they result in similar patterns after training with the same perovskite data.

However, these 2D plots only account for part of the 64-dimensional element representation vectors. To fully understand how element properties are learned by CGCNN, we use linear logistic regression (LR) models to predict the block type, group number, radius, and electronegativity of each element from their learned representation vectors. In Fig.

3(c), we show the 3-fold cross validation accuracy of the LR models and compare them with LR models learned from random representations, which helps to rule out the possibility that the predictions are caused by coincidences. We discover a significantly higher prediction accuracy of the learned representations for all four properties, demonstrating that the element representations can reflect multiple aspects of element properties. For instance, the model predicts the block of the element with over 90% accuracy, and the same representation also predicts the group number, radius, and electronegativity with over 60% accuracy. This is surprising considering that there are 16 different elemental groups represented. It is worth noting that these representations are learned only from the perovskite structures and the total energy above hull, but they are in agreement with these empirical element properties reflecting decades of human chemical intuition.

iii.2 Elemental boron: structural space

Figure 4: Visualization of the local environment representations learned from the elemental boron dataset. The original 64D vectors are reduced to 2D with the t-distributed stochastic neighbor embedding algorithm. The color of each plot is coded with learned local energy (a), number of neighbors calculated by Pymatgen package Ong et al. (2013) (b), and density (c). Representative boron local environments are shown with the center atom colored in red.

As a second example, we explore the structural space of elemental boron by visualizing the local environment representations and the corresponding local energies. Elemental boron has a number of complex crystal structures due to its unique, electron-deficient bonding nature Ogitsu, Schwegler, and Galli (2013); Deringer, Pickard, and Csányi (2018). We use a dataset that includes 5038 distinct elemental boron structures and their total energies calculated using density functional theory Deringer, Pickard, and Csányi (2018). We train our CGCNN model with 3038 structures, and perform hyperparameter optimization with 1000 validation structures. The MAE of predicted energy relative to DFT results on the remaining 1000 test structures is 0.085 eV/atom. The learning curve in Fig. 2 shows a much smaller slope compared with the other material spaces. One explanation is that there exist many highly unstable boron structures in the dataset, whose energies might be hard to predict given the limited structures covered by the training data.

In Fig. 4, 1000 randomly sampled boron local environment representations are visualized in 2 dimensions using the t-distributed stochastic neighbor embedding (t-SNE) algorithm Maaten and Hinton (2008). We observe primarily four different regions of different boron local environments, and we discover a smooth transition of local energy, number of neighbor atoms, and the density between different regions. The disconnected region consists of boron atoms at the edge of boron clusters [Fig. S1(a-c)]. These atoms have very high local energies and lower number of neighbors, as to be expected, and their density varies depending on the distances between clusters. The amorphous region includes boron atoms in a relatively disordered local configuration, and their local energies are lower than the disconnected counterparts but higher than other other configurations [Fig. S1(d-f)]. We can see that the number of neighbors fluctuates drastically in these two regions due to the relatively disordered local structures. The layered region is composed of boron atoms in layered boron planes, where neighbors on one side are closely bonded and the neighbors on the other side are further away [Fig. S1(g-i)]. The B_12 icosahedron region includes boron local environments with the lowest local energy, which have a characteristic icosahedron structure [Fig. S1(j-l)]. The local environments in each region share common characteristics but are slightly different in detail. For instance, most boron atoms in the B_12 icosahedron region are in a slightly distorted icosahedron, and the local environments in Fig. S1(l) only have certain features of an icosahedron. Note that these representations are rather localized. The global structure of Fig. S1(c) is layered, but the representation of the highlighted atom at the edge is closer to the disconnected region locally. Some experimentally observed boron structures, like boron fullerenes, are not presented in the dataset. We calculate the local environment representations of every distinct boron atom of two boron fullerenes Zhai et al. (2014) using the trained CGCNN, and plot them into the original 2D visualization in Fig. S3. They form a small cluster close to the B_12 icosahedron region. This can be explained by the fact that they share many common characteristics to the B_12 icosahedron structure. In addition, the representations of the less symmetric B_40() are more spread out than the more symmetric B_40(). Note that the pattern in Fig. S3 is slightly different from that in Fig. 4 due to the random nature of the t-SNE algorithm, but the overall structure of the patterns is preserved.

Taken together, such a visualization approach provides a convenient way to explore complex boron configurations, enabling the identification of characteristic structures and systematic exploration of structural space.

iii.3 Materials Project: compositional and structural space

Figure 5: Visualization of the local oxygen (a) and sulfur (b) coordination environments. The points are labelled according to the type of the center atoms in the coordination environments. The colors of the upper parts are coded with learned local energies, and the color of the lower parts are coded with number of neighbors Ong et al. (2013), octahedron order parameter, and tetrahedron order parameter Zimmermann et al. (2017).

As a third example of applying this approach, we explore the material space of crystals in the Materials Project dataset Jain et al. (2013), which includes both compositional and structural differences, by visualizing the element representation, local environment representation, and the local energy representation. The dataset includes 46744 materials that cover the majority of crystals from the Inorganic Crystal Structure Database Hellenbrandt (2004), providing a good representation of known inorganic materials. After training with 28046 crystals and performing hyperparameter optimization with 9348 crystals, our model achieves MAE of predicted energy relative to DFT calculations on the 9348 test crystals of 0.042 eV/atom, slightly higher than the MAE of our previous work, 0.039 eV/atom, with a CGCNN structure focusing on prediction performance Xie and Grossman (2018). The learning curve in Fig. 2 is similar to that of the perovskites dataset, which might indicate a similar prediction performance to the datasets that are composed of stable inorganic compounds. In Table 1, we compare the prediction performance of this method with several recently published works.

Method MAE (eV/atom) Data source Training size
This work 0.042 MP 28,046
CGCNN Xie and Grossman (2018) 0.039 MP 28,046
SchNet Schütt et al. (2018) 0.035 MP 60,000
Generalized Coulomb matrix Faber et al. (2015) 0.37 MP 3,000
Decision trees + heuristic Meredig et al. (2014) 0.12 Ternary compounds 15,000
Voronoi + composition Ward et al. (2017) 0.08 OQMD 30,000
QML Faber et al. (2018) 0.11 OQMD 2,000
Random subspace + REPTree Ward et al. (2016) 0.088 OQMD 228,676
Table 1: Comparison of the prediction performance of formation energy per atom. The mean absolute errors (MAEs) on test data reported in several recent works are summarized. Data come from several different but similar inorganic crystal material datasets. MP represents materials project Jain et al. (2013), OQMD represents the open quantum materials database Saal et al. (2013), and the ternary compounds are A_xB_yC_z compounds calculated by Ref. Meredig et al. (2014).

In Fig. S2, the element representation of 89 elements learned from the dataset is shown using the same method as that used to generate Fig. 3(b). We observe similar grouping of elements from the same elemental groups, but the overall pattern differs since it reflect the stability of each element in general inorganic crystals rather than perovskites. For instance, the non-metal and halogen elements stand out because their properties deviates from other metallic elements.

To illustrate how the compositional and structural spaces can be explored simultaneously, we visualize the oxygen and sulfur coordination environments in the Materials Project dataset using the local environment representation and local energy. 1000 oxygen and 803 sulfur coordination environments are randomly selected and visualized using the t-SNE algorithm. As shown in Fig. 5(a), the oxygen coordination environments are clustered into 4 major groups. The upper right group has the center atom of non-metal elements like P, Al, Si, forming tetrahedron coordinations. The center atoms of the upper left environments are mostly transition metals, and they mostly form octahedron coordinations. The lower left group has center atoms of alkali metals, and the lower right group has those of alkaline earth metals and lanthanides which have larger radii and therefore higher coordination numbers. The sulfur coordination environment visualization [Fig. 5(b)] shares similar patterns due to the similarities between oxygen and sulfur, and a similar four-cluster structure can be observed. However, instead of non-metal elements, the lower center group has center atoms of metalloids like Ge, Sn, Sb, since these elements will be more stable in a sulfur vs. oxygen coordination environment.

The local energy of oxygen and sulfur coordination environments are determined by their relative stability to the pure elemental states since the model is trained using the formation energy data, which treats the pure elemental states as the reference energy states. In Fig. S3, we show the change of local energy of oxygen and sulfur local energies as a function of atomic number. We can clearly see that it follows a similar trend as the electronegativity of the elements: elements with lower electronegativity tend to have lower local energy and vice versa. This is because elements with lower electronegativity tends to give the oxygen and sulfur more electrons and thus form stronger bonds. The local energies of alkali metals are slightly higher since they form weaker ionic bonds due to lower charges. Interestingly, the strong covalent bonds between oxygen and Al, Si, P, S forms a V-shaped curve in the figure, with Si-O environments having the lowest energy, contrasting the trend of electronegativity and sulfur coordination environments, whose local energies are dominated by the strength of ionic bonds. We also observe a larger span of local energies in oxygen coordination environments than their sulfur counterparts due to the stronger ionic interactions.

Inspired by these results, we visualize the averaged local energy of 734,077 distinct coordination environments in the Materials Project by combining different center and neighbor atoms in Fig. 6. This figure illustrates the stability of the local coordination environment while combining the corresponding center and neighbor elements. The diagonal line represents coordination environments made up with the same elements with local energy close to zero, which corresponds to elemental substances with zero formation energy. The coordination environments with lowest local energy consist of high valence metals and high electronegativity non-metals, which can be explained by the large cohesive energies due to strong ionic bonds. One abnormality is the stable Al-O, Si-O, P-O, S-O coordination environments, although this can be attributed to their strong covalent bonds. We can also see that Tm-H coordination stands out as a stable hydrogen solid solution Bonnet and Daou (1979). It is worth noting that each local energy in Fig. 6 is the average of many coordination environments with different shape and outer layer chemistry, and we can obtain more information by using additional visualizations similar to Fig. 5.

Figure 6: The averaged local energy of 734,077 distinct coordination environments in the Materials Project dataset. The color is coded with the average of learned local energies while having the corresponding elements as the center atom and the first neighbor atom. White is used when no such coordination environment exists in the dataset.

Iv Conclusion

In summary, we developed a unified approach to visualize the compositional and structural space of materials. The method provides hierarchical representations of the local environments at different scales, which enables a general framework to explore different material systems and measure material similarities. The insights gained from the visualizations could help to discover patterns from a large pool of candidate materials that may be impossible by human analysis, and provide guidance to the design of new materials. In addition to energies, this method can potentially be applied to other material properties for the exploration of novel functional materials.

V Supplementary Material

See supplementary material for the details of the hyperparameters for each model, results of the effects of the number of training data on element representations, additional figures showing the structures of boron local environments and the location of boron fullerene local environment representations with respect to the representations of other boron structures, results of the element representations learned from the Materials Project dataset, and results of the change of local energy as a function of atomic number.

This work was supported by Toyota Research Institute. Computational support was provided through the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, and the Extreme Science and Engineering Discovery Environment, supported by National Science Foundation grant number ACI-1053575.


  • Niu, Guo, and Wang (2015) G. Niu, X. Guo,  and L. Wang, “Review of recent progress in chemical stability of perovskite solar cells,” Journal of Materials Chemistry A 3, 8970–8980 (2015).
  • Snaith (2013) H. J. Snaith, “Perovskites: the emergence of a new era for low-cost, high-efficiency solar cells,” The Journal of Physical Chemistry Letters 4, 3623–3630 (2013).
  • Xu et al. (2013) M. Xu, T. Liang, M. Shi,  and H. Chen, “Graphene-like two-dimensional materials,” Chemical reviews 113, 3766–3798 (2013).
  • Butler et al. (2013) S. Z. Butler, S. M. Hollen, L. Cao, Y. Cui, J. A. Gupta, H. R. Gutiérrez, T. F. Heinz, S. S. Hong, J. Huang, A. F. Ismach, et al., “Progress, challenges, and opportunities in two-dimensional materials beyond graphene,” ACS nano 7, 2898–2926 (2013).
  • Madelung (1964) O. Madelung, Physics of III-V compounds (J. Wiley, 1964).
  • Greeley et al. (2006) J. Greeley, T. F. Jaramillo, J. Bonde, I. Chorkendorff,  and J. K. Nørskov, “Computational high-throughput screening of electrocatalytic materials for hydrogen evolution,” Nature materials 5, 909 (2006).
  • Senkan (1998) S. M. Senkan, “High-throughput screening of solid-state catalyst libraries,” Nature 394, 350 (1998).
  • Potyrailo et al. (2011) R. Potyrailo, K. Rajan, K. Stoewe, I. Takeuchi, B. Chisholm,  and H. Lam, “Combinatorial and high-throughput screening of materials libraries: review of state of the art,” ACS combinatorial science 13, 579–633 (2011).
  • Curtarolo et al. (2013) S. Curtarolo, G. L. Hart, M. B. Nardelli, N. Mingo, S. Sanvito,  and O. Levy, “The high-throughput highway to computational materials design,” Nature materials 12, 191 (2013).
  • Hautier et al. (2010) G. Hautier, C. C. Fischer, A. Jain, T. Mueller,  and G. Ceder, “Finding nature’s missing ternary oxide compounds using machine learning and density functional theory,” Chemistry of Materials 22, 3762–3767 (2010).
  • Gómez-Bombarelli et al. (2016) R. Gómez-Bombarelli, J. Aguilera-Iparraguirre, T. D. Hirzel, D. Duvenaud, D. Maclaurin, M. A. Blood-Forsythe, H. S. Chae, M. Einzinger, D.-G. Ha, T. Wu, et al., “Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach,” Nature materials 15, 1120 (2016).
  • Faber et al. (2016) F. A. Faber, A. Lindmaa, O. A. Von Lilienfeld,  and R. Armiento, “Machine learning energies of 2 million elpasolite (a b c 2 d 6) crystals,” Physical review letters 117, 135502 (2016).
  • Rupp et al. (2012) M. Rupp, A. Tkatchenko, K.-R. Müller,  and O. A. Von Lilienfeld, “Fast and accurate modeling of molecular atomization energies with machine learning,” Physical review letters 108, 058301 (2012).
  • Pilania et al. (2013) G. Pilania, C. Wang, X. Jiang, S. Rajasekaran,  and R. Ramprasad, “Accelerating materials property predictions using machine learning,” Scientific reports 3, 2810 (2013).
  • Meredig et al. (2014) B. Meredig, A. Agrawal, S. Kirklin, J. E. Saal, J. Doak, A. Thompson, K. Zhang, A. Choudhary,  and C. Wolverton, “Combinatorial screening for new materials in unconstrained composition space with machine learning,” Physical Review B 89, 094104 (2014).
  • Ward et al. (2016) L. Ward, A. Agrawal, A. Choudhary,  and C. Wolverton, “A general-purpose machine learning framework for predicting properties of inorganic materials,” npj Computational Materials 2, 16028 (2016).
  • Ghiringhelli et al. (2015) L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl,  and M. Scheffler, “Big data of materials science: critical role of the descriptor,” Physical review letters 114, 105503 (2015).
  • Behler (2011) J. Behler, “Atom-centered symmetry functions for constructing high-dimensional neural network potentials,” The Journal of chemical physics 134, 074106 (2011).
  • Pietrucci and Andreoni (2011) F. Pietrucci and W. Andreoni, “Graph theory meets ab initio molecular dynamics: atomic structures and transformations at the nanoscale,” Physical review letters 107, 085504 (2011).
  • Sadeghi et al. (2013) A. Sadeghi, S. A. Ghasemi, B. Schaefer, S. Mohr, M. A. Lill,  and S. Goedecker, “Metrics for measuring distances in configuration spaces,” The Journal of chemical physics 139, 184118 (2013).
  • Bartók, Kondor, and Csányi (2013) A. P. Bartók, R. Kondor,  and G. Csányi, “On representing chemical environments,” Physical Review B 87, 184115 (2013).
  • Schütt et al. (2014) K. Schütt, H. Glawe, F. Brockherde, A. Sanna, K. Müller,  and E. Gross, “How to represent crystal structures for machine learning: Towards fast prediction of electronic properties,” Physical Review B 89, 205118 (2014).
  • Faber et al. (2018) F. A. Faber, A. S. Christensen, B. Huang,  and O. A. von Lilienfeld, “Alchemical and structural distribution based representation for universal quantum machine learning,” The Journal of Chemical Physics 148, 241717 (2018).
  • Glielmo, Zeni, and De Vita (2018) A. Glielmo, C. Zeni,  and A. De Vita, “Efficient nonparametric n-body force fields from machine learning,” Physical Review B 97, 184307 (2018).
  • Ward et al. (2017) L. Ward, R. Liu, A. Krishna, V. I. Hegde, A. Agrawal, A. Choudhary,  and C. Wolverton, “Including crystal structure attributes in machine learning models of formation energies via voronoi tessellations,” Physical Review B 96, 024104 (2017).
  • Isayev et al. (2015) O. Isayev, D. Fourches, E. N. Muratov, C. Oses, K. Rasch, A. Tropsha,  and S. Curtarolo, “Materials cartography: representing and mining materials space using structural and electronic fingerprints,” Chemistry of Materials 27, 735–743 (2015).
  • De et al. (2016) S. De, A. P. Bartók, G. Csányi,  and M. Ceriotti, “Comparing molecules and solids across structural and alchemical space,” Physical Chemistry Chemical Physics 18, 13754–13769 (2016).
  • Musil et al. (2018) F. Musil, S. De, J. Yang, J. E. Campbell, G. M. Day,  and M. Ceriotti, “Machine learning for the structure–energy–property landscapes of molecular crystals,” Chemical science 9, 1289–1300 (2018).
  • Das et al. (2006) P. Das, M. Moll, H. Stamati, L. E. Kavraki,  and C. Clementi, “Low-dimensional, free-energy landscapes of protein-folding reactions by nonlinear dimensionality reduction,” Proceedings of the National Academy of Sciences 103, 9885–9890 (2006).
  • Ceriotti, Tribello, and Parrinello (2011) M. Ceriotti, G. A. Tribello,  and M. Parrinello, “Simplifying the representation of complex free-energy landscapes using sketch-map,” Proceedings of the National Academy of Sciences 108, 13023–13028 (2011).
  • Spiwok and Králová (2011) V. Spiwok and B. Králová, “Metadynamics in the conformational space nonlinearly dimensionally reduced by isomap,” The Journal of chemical physics 135, 224504 (2011).
  • Rohrdanz, Zheng, and Clementi (2013) M. A. Rohrdanz, W. Zheng,  and C. Clementi, “Discovering mountain passes via torchlight: methods for the definition of reaction coordinates and pathways in complex macromolecular reactions,” Annual review of physical chemistry 64, 295–316 (2013).
  • Pietrucci and Martoňák (2015) F. Pietrucci and R. Martoňák, “Systematic comparison of crystalline and amorphous phases: Charting the landscape of water structures and transformations,” The Journal of chemical physics 142, 104704 (2015).
  • Engel et al. (2018) E. A. Engel, A. Anelli, M. Ceriotti, C. J. Pickard,  and R. J. Needs, “Mapping uncharted territory in ice from zeolite networks to ice structures,” Nature communications 9, 2173 (2018).
  • Seko et al. (2015) A. Seko, A. Togo, H. Hayashi, K. Tsuda, L. Chaput,  and I. Tanaka, “Prediction of low-thermal-conductivity compounds with first-principles anharmonic lattice-dynamics calculations and bayesian optimization,” Physical review letters 115, 205901 (2015).
  • Isayev et al. (2017) O. Isayev, C. Oses, C. Toher, E. Gossett, S. Curtarolo,  and A. Tropsha, “Universal fragment descriptors for predicting properties of inorganic crystals,” Nature communications 8, 15679 (2017).
  • Behler and Parrinello (2007) J. Behler and M. Parrinello, “Generalized neural-network representation of high-dimensional potential-energy surfaces,” Physical review letters 98, 146401 (2007).
  • Botu et al. (2016) V. Botu, R. Batra, J. Chapman,  and R. Ramprasad, “Machine learning force fields: construction, validation, and outlook,” The Journal of Physical Chemistry C 121, 511–522 (2016).
  • Duvenaud et al. (2015) D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik,  and R. P. Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in Advances in neural information processing systems (2015) pp. 2224–2232.
  • Kearnes et al. (2016) S. Kearnes, K. McCloskey, M. Berndl, V. Pande,  and P. Riley, “Molecular graph convolutions: moving beyond fingerprints,” Journal of computer-aided molecular design 30, 595–608 (2016).
  • Gilmer et al. (2017) J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals,  and G. E. Dahl, “Neural message passing for quantum chemistry,” arXiv preprint arXiv:1704.01212  (2017).
  • Schütt et al. (2017)

    K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller,  and A. Tkatchenko, “Quantum-chemical insights from deep tensor neural networks,” Nature communications 

    8, 13890 (2017).
  • Xie and Grossman (2018) T. Xie and J. C. Grossman, “Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties,” Physical Review Letters 120, 145301 (2018).
  • Schütt et al. (2018) K. T. Schütt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko,  and K.-R. Müller, “Schnet–a deep learning architecture for molecules and materials,” The Journal of Chemical Physics 148, 241722 (2018).
  • Wu et al. (2018) Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing,  and V. Pande, “Moleculenet: a benchmark for molecular machine learning,” Chemical science 9, 513–530 (2018).
  • Blank et al. (1995) T. B. Blank, S. D. Brown, A. W. Calhoun,  and D. J. Doren, “Neural network models of potential energy surfaces,” The Journal of chemical physics 103, 4129–4137 (1995).
  • Deringer, Pickard, and Csányi (2018) V. L. Deringer, C. J. Pickard,  and G. Csányi, “Data-driven learning of total and local energies in elemental boron,” Physical review letters 120, 156001 (2018).
  • Schmidt et al. (2017) J. Schmidt, J. Shi, P. Borlido, L. Chen, S. Botti,  and M. A. Marques, “Predicting the thermodynamic stability of solids combining density functional theory and machine learning,” Chemistry of Materials 29, 5090–5103 (2017).
  • Zhou et al. (2018) Q. Zhou, P. Tang, S. Liu, J. Pan, Q. Yan,  and S.-C. Zhang, “Learning atoms for materials discovery,” Proceedings of the National Academy of Sciences , 201801181 (2018).
  • Willatt, Musil, and Ceriotti (2018) M. J. Willatt, F. Musil,  and M. Ceriotti, “A data-driven construction of the periodic table of the elements,” arXiv preprint arXiv:1807.00236  (2018).
  • Jain et al. (2013) A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder,  and K. a. Persson, “The Materials Project: A materials genome approach to accelerating materials innovation,” APL Materials 1, 011002 (2013).
  • Castelli et al. (2012a) I. E. Castelli, D. D. Landis, K. S. Thygesen, S. Dahl, I. Chorkendorff, T. F. Jaramillo,  and K. W. Jacobsen, “New cubic perovskites for one-and two-photon water splitting using the computational materials repository,” Energy & Environmental Science 5, 9034–9043 (2012a).
  • Castelli et al. (2012b) I. E. Castelli, T. Olsen, S. Datta, D. D. Landis, S. Dahl, K. S. Thygesen,  and K. W. Jacobsen, “Computational screening of perovskite metal oxides for optimal solar light capture,” Energy & Environmental Science 5, 5814–5819 (2012b).
  • Ong et al. (2013) S. P. Ong, W. D. Richards, A. Jain, G. Hautier, M. Kocher, S. Cholia, D. Gunter, V. L. Chevrier, K. A. Persson,  and G. Ceder, “Python materials genomics (pymatgen): A robust, open-source python library for materials analysis,” Computational Materials Science 68, 314 – 319 (2013).
  • Ogitsu, Schwegler, and Galli (2013) T. Ogitsu, E. Schwegler,  and G. Galli, “-rhombohedral boron: at the crossroads of the chemistry of boron and the physics of frustration,” Chemical reviews 113, 3425–3449 (2013).
  • Maaten and Hinton (2008) L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research 9, 2579–2605 (2008).
  • Zhai et al. (2014) H.-J. Zhai, Y.-F. Zhao, W.-L. Li, Q. Chen, H. Bai, H.-S. Hu, Z. A. Piazza, W.-J. Tian, H.-G. Lu, Y.-B. Wu, et al., “Observation of an all-boron fullerene,” Nature chemistry 6, 727 (2014).
  • Zimmermann et al. (2017) N. E. R. Zimmermann, M. K. Horton, A. Jain,  and M. Haranczyk, “Assessing local structure motifs using order parameters for motif recognition, interstitial identification, and diffusion path characterization,” Frontiers in Materials 4, 34 (2017).
  • Hellenbrandt (2004) M. Hellenbrandt, “The inorganic crystal structure database (icsd)—present and future,” Crystallography Reviews 10, 17–22 (2004).
  • Saal et al. (2013) J. E. Saal, S. Kirklin, M. Aykol, B. Meredig,  and C. Wolverton, “Materials design and discovery with high-throughput density functional theory: the open quantum materials database (oqmd),” Jom 65, 1501–1509 (2013).
  • Faber et al. (2015) F. Faber, A. Lindmaa, O. A. von Lilienfeld,  and R. Armiento, “Crystal structure representations for machine learning models of formation energies,” International Journal of Quantum Chemistry 115, 1094–1101 (2015).
  • Bonnet and Daou (1979) J. Bonnet and J. Daou, “Study of the hydrogen solid solution in thulium,” Journal of Physics and Chemistry of Solids 40, 421–425 (1979).
  • Cordero et al. (2008) B. Cordero, V. Gómez, A. E. Platero-Prats, M. Revés, J. Echeverría, E. Cremades, F. Barragán,  and S. Alvarez, “Covalent radii revisited,” Dalton Transactions , 2832–2838 (2008).
  • (64) . Mentel, “mendeleev – a python resource for properties of chemical elements, ions and isotopes,” .

I Supplementary Methods

i.1 Logistic Regression Models

In the perovskite dataset, we use logistic regression models to predict four different elemental properties. We treat all four predictions as classification problem for consistency, although some of the properties have continuous values. We summarized the categories of each elemental properties in Table 2.

Ii Supplementary Tables

Dataset # of convolutional layers Length of representation learning rate
Perovskites 4 64 0.005
Elemental B 4 64 0.005
Materials Project 4 64 0.005
Table 1: Hyperparameters selected for each dataset.
Elemental property # of categories Categories
Block 3 s, p, d
Group 16 1, 2, …, 16
Radius (Cordero et al. (2008) 5 [83, 116), [116, 148), [148, 180), [180, 212), [212, 244)
Electronegativity Mentel 5 [0.788, 1.112), [1.112, 1.434), [1.434, 1.756), [1.756, 2.078), [2.078, 2.4)
Table 2: The categories of each elemental property logistic regression models.

Iii Supplementary Figures

Figure S1: The evolution of element representations as the number of training data increases. The number of training data used are (a) 234, (b) 937, (c) 3,750, (d) 15,000.
Figure S2: Example local environments of elemental boron in the four regions: (a-c) disconnected, (d-f) amorphous, (h-i) layered, and (j-l) icosahedron.
Figure S3: The boron fullerene local environments in the boron structural space. The representation of each distinct local environments in the two B_40 structures are plotted in the original boron structural space in Fig. 4.
Figure S4: Visualization of the two principal dimensions of the element representations learned from the Materials Project dataset using principal component analysis.
Figure S5: The local energy of oxygen (upper) and sulfur (lower) coordination environments as a function of atomic number. The blue dotted line denotes the electronegativity of each element.