Distributed Representations of Atoms and Materials for Machine Learning

07/30/2021 ∙ by Luis M. Antunes, et al. ∙ University of Reading 0

The use of machine learning is becoming increasingly common in computational materials science. To build effective models of the chemistry of materials, useful machine-based representations of atoms and their compounds are required. We derive distributed representations of compounds from their chemical formulas only, via pooling operations of distributed representations of atoms. These compound representations are evaluated on ten different tasks, such as the prediction of formation energy and band gap, and are found to be competitive with existing benchmarks that make use of structure, and even superior in cases where only composition is available. Finally, we introduce a new approach for learning distributed representations of atoms, named SkipAtom, which makes use of the growing information in materials structure databases.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, the study of machine learning (ML) has had a significant impact on many disciplines. Accordingly, materials science and chemistry has recently seen a surge in interest in applying the most recent advances in ML to the problems of the field [15, 39, 13, 2, 35, 40]. A central problem in materials science is the rational design of materials with specific properties. Typically, useful materials have been discovered serendipitously [8]. With the advent of ubiquitous and capable computing infrastructure, materials discovery has been increasingly aided by computational chemistry, especially density functional theory (DFT) simulations [46]. Such theoretical calculations are indispensable when investigating the properties of novel materials. However, they are computationally intensive, and performing such analysis on large numbers of compounds (there are more than chemically sensible stoichiometric quaternary compounds possible [5]) becomes impractical with today’s computing technology. Moreover, certain chemical systems, such as those with very strongly correlated electrons, or with high levels of disorder, remain a theoretical challenge to DFT [9, 26].

The application of ML to materials science aims to ameliorate some of these problems, by providing alternate computational routes to properties of interest. There have been numerous examples of the successful application of ML to chemical systems. Techniques from ML have been used to predict very local and detailed properties, such as atomic and molecular orbital energies and geometries [34] or partial charges[36], and also global properties, such as the formation energy and band gap of a given compound [12, 45, 6, 1].

For a ML algorithm to work effectively, the objects of the system of interest must be converted into faithful representations that can be consumed in a computational context. Deriving such representations has been a main focus for researchers in ML, and in the case of Deep Learning, such representations are typically learned automatically, as part of the training process


. Related to this is the concept of Unsupervised Learning, where patterns in the data are derived without the use of labels, or other forms of supervision


. Indeed, given that most data is unlabelled, such techniques are very valuable. Some of the most successful and widely used algorithms, such as Word2Vec from the field of Natural Language Processing (NLP), use unsupervised learning to derive effective representations of the objects in the system of interest (words, in this case)

[27, 23].

The most basic object of interest in chemical systems is very often the atom. Thus, there have already been several investigations examining the derivation of effective machine representations of atoms in an unsupervised setting [44, 41, 3]

, and other investigations have aimed to learn good atomic representations in the context of a supervised learning task.

[19, 14]

A learned representation of an atom generally takes the form of an embedding, which can be described as a relatively low-dimensional space in which higher-dimensional vectors can be expressed. Using embeddings in a ML task is advantageous, as the number of input dimensions is typically lower than if higher-dimensional sparse vectors were used. Moreover, embeddings which are semantically similar reside closer together in space, which provides a more principled structure to the input data. Such representations should allow ML models to learn a task more quickly and effectively.

A widely held hypothesis in ML is that unlabelled data can be used to learn effective representations. In this work, we introduce a new approach for learning atomic representations using an unsupervised approach. This approach, which we name SkipAtom, is inspired by the Skip-gram model in NLP, and takes advantage of the large number of inorganic structures in materials databases. We also investigate forming representations of chemical compounds by pooling atomic representations. Combining vectors by various pooling operations to create representations of systems composed from parts (e.g. sentences from words) is a common technique in NLP, but apparently remains largely unexplored in materials informatics [28]. The analogy we explore here is that atoms are to compounds as words are to sentences, and our results demonstrate that effective representations of compounds can be composed from the vector representations of the constituent atoms. Finally, a common problem when searching chemical space for new materials is that the structure of a compound may not be known. Since the properties of a material are typically tightly coupled to its structure, this creates a significant barrier [25]. Here, we compare our models, which operate on representations derived from chemical formulas only, to benchmarks that are based on models that use structural information. We find that, for certain tasks, the performance of the composition-only models is comparable.

2 Representations of Atoms and Compounds

There are various strategies for providing an atom with a machine representation. These range from very simple and unstructured approaches, such as assigning a random vector to each atom, to more sophisticated approaches, such as learning distributed representations. A distributed representation is a characterization of an object attained by embedding in a continuous vector space, such that similar objects will be closer together.

Similarly, a compound may be assigned a machine representation. Again, these representations may be learned on a case-by-case basis, or they may be formed by composing existing representations of the corresponding atoms.

2.1 Atomic Representations

We are interested in deriving representations of atoms that can be used in a computational context, such as a ML task. Intuitively, we would like the representations of similar atoms to be similar as well. Given that atoms are multifaceted objects, a natural choice for a computational descriptor for an atom might be a vector: an

-tuple of real numbers. Vector spaces are well understood, and can provide the degrees of freedom necessary to express the various facets that constitute an atom. Moreover, with an appropriately selected vector space, such atomic representations can be subjected to the various vector operations to quantify relationships and to compose descriptions of systems of atoms, or compounds.

2.1.1 Random Vectors

The simplest approach to assigning a vector description to an atom is to simply draw a random vector from

, and assign it to the atom. Such vectors can come from any distribution desired, but in this report, such vectors will come from the standard normal distribution,


2.1.2 One-hot Vectors

One-hot vectors, common in ML, are binary vectors that are used for distinguishing between various categories. One assigns a vector component to each category of interest, and sets the value of the corresponding component to 1 when the vector is describing a given category, and the value of all other components to 0. More formally, a one-hot -dimensional vector is in the set such that , where is a component of . A unique one-hot vector is assigned to each category. In the context of this report, a category is an atom (Figure 1a).

Figure 1: a Scheme illustrating one-hot and distributed representations of atoms. In the diagram, there are atoms represented, and is the adjustable number of dimensions of the distributed representation. Note that the atoms in this example are H, He and Pu, but they could be any atom. b Scheme describing how training data is derived for the creation of SkipAtom vectors. Here, a graph representing the atomic connectivity in the structure of Ba2N4 is depicted, and the resulting target-context atom pairs derived for training. The graph is derived from the unit cell of Ba2N4. c Scheme describing how the SkipAtom vectors are derived through training. Here, a one-hot vector, , representing a particular atom is transformed into an intermediate vector via multiplication with matrix . The matrix is the embedding matrix, whose columns will be the final atom vectors after training. Training consists of minimizing the cross-entropy loss between the output vector and the one-hot vector representing the context atom, . The output is obtained by applying the function to the product .

2.1.3 Atom2Vec

If one may know a word by the company it keeps, then the same might be said of an atom. In 2018, Zhou and coworkers described an approach for deriving distributed atom vectors that involves generating a co-occurrence count matrix of atoms and their chemical environments, using an existing database of materials, and applying Singular Value Decomposition (SVD) to the matrix.

[44] The number of dimensions of the resulting atomic vectors is limited to the number of atoms used in the matrix.

2.1.4 Mat2Vec

A popular means of generating word vectors in NLP is through the application of the Word2Vec algorithm, wherein an unsupervised learning task is employed [27]

. Given a corpus (a collection of text), the goal is to predict the likelihood of a word occurring in the context of another. A neural network architecture is employed, and the learned parameters of the projection layer constitute the word vectors that result after training. In 2019, Tshitoyan and coworkers described an approach for deriving distributed atom vectors by making direct use of the materials science literature

[41]. Instead of using a database of materials, they assembled a textual corpus from millions of scientific abstracts related to materials science research, and then applied the Word2Vec algorithm to derive the atom representations.

2.1.5 SkipAtom

In the NLP Skip-gram model, an occurrence of a word in a corpus is associated with the words that co-occur within a context window of a certain size. The task is to predict the context words given the target word. Although the aim is not to build a classifier, the act of tuning the parameters of the model so that it is able to predict the context of a word results in a parameter matrix that acts effectively as the embedding table for the words in the corpus. Words that share the same contexts should share similar semantic content, and this is reflected in the resulting learned low-dimensional space. Analogously, atoms that share the same chemo-structural environments should share similar chemistry.

In the SkipAtom approach, the crystal structures of materials from a database are used in the form of a graph, representing the local atomic connectivity in the material, to derive a dataset of connected atom pairs (Figure 1

b). Then, similarly to the Skip-gram approach of the Word2Vec algorithm, Maximum Likelihood Estimation is applied to the dataset to learn a model that aims to predict a context atom given a target atom.

More formally, a materials database consists of a set of materials, . A material, , can be represented as an undirected graph, consisting of a set of atoms, , comprising the material, and bonds

, which are unordered pairs of atoms. The task is to maximize the average log probability:


where are the neighbours of (not including itself); more specifically: .

In practice, this means that the cross-entropy loss between the one-hot vector representing the context atom and the normalized probabilities produced by the model, given the one-hot vector representing the target atom, is minimised (Figure 1c).

The graph representing a material can be derived using any approach desired, but in this work, an approach is used which is based on a Voronoi algorithm, which identifies nearest neighbours using solid angle weights to determine the probability of various coordination environments.

The result of SkipAtom training is a set of vectors, one for each atom of interest (Figure 1a), that reflects the unique chemical nature of the represented atom, as well as its relationship to other atoms.

A complicating factor in the procedure just described is that some atoms may be under-represented in the database, relative to others. This will result in the parameters of those infrequently occurring atoms receiving fewer updates during training, resulting in lower quality representations for those atoms. This is an issue when learning word representations as well, and there have been several solutions proposed in the context of NLP [32, 33]. Borrowing from these solutions, we apply an additional, optional processing step to the learned vectors, termed induction

. The aim is to adjust the learned vectors so that they reside in a more sensible area of the representation space. To achieve this, each atom is first represented as a triple, given by its periodic table group number and row number, and its electronegativity. Then, for each atom, the closest atoms are obtained, in terms of the cosine similarity between the vectors formed from these triples. Using the learned embeddings for these closest atoms, a

mean nearest-neighbour representation is derived, and the induced atom vector, , is formed by adding the original atom vector, , to the mean nearest neighbour:


where is the number of closest atoms to consider, and is the learned embedding of the nearest atom from the sorted list of nearest atoms. In this work, the nearest 5 atoms are considered.

2.2 Compound Representations

Atom vectors by themselves may not be directly useful, as most problems in materials informatics involve chemical compounds. However, atom vectors can be combined to form representations of compounds.

2.2.1 Atom Vector Pooling

The most basic and general way of combining atom vectors to form a representation for a compound is to perform a pooling operation on the atom vectors corresponding to the atoms in the chemical formula for the compound. There are three common pooling operations: sum-pooling, mean-pooling, and max-pooling.

Sum-pooling involves performing component-wise addition of the atom vectors for the atoms in the chemical formula. That is, for a chemical compound whose formula is comprised of constituent elements, and a set of atom vectors, , the compound vector, , is given by:


where is the corresponding atom vector for the th constituent element in the formula, and is the relative number of atoms of the th constituent element (which need not be a whole number, as in the case of non-stoichiometric compounds).

Mean-pooling involves performing component-wise addition of the atom vectors for the atoms in the chemical formula, followed by dividing by the total number of atoms in the formula:


Finally, max-pooling involves taking the maximum value for each component of the vectors being pooled:


where returns a vector where each component has the maximum value of that component across input vectors.

2.2.2 ElemNet (Mean-pooled One-hot Vectors)

If we assign a unique one-hot vector to each atom, and perform mean-pooling of these vectors when forming a representation for a chemical compound, then the result is the same as the input representation for the ElemNet model [19]. Such a compound vector is sparse (as most compounds do not typically contain more than 5 or 6 atom types). Each component of the vector contains the unit normalized amount of the atom in the formula. For example, for H2O, the component corresponding to H would have a value of 0.66 whereas the component corresponding to O would have a value of 0.33, and all other components would have a value of zero.

2.2.3 Bag-of-Atoms (Sum-pooled One-hot Vectors)

In NLP, the Bag-of-Words is a common representation used for sentences and documents. It is formed by simply performing sum-pooling of the one-hot vectors for each word in the text. Similarly, we can conceive of a Bag-of-Atoms representation for chemical informatics, where sum-pooling is performed with the one-hot vectors for the atoms in a chemical formula. The result is a list of counts of each atom type in the formula. This is an unscaled version of the ElemNet representation. Crucially, this sum-pooling of one-hot vectors is more appropriate for describing compounds than it is for describing natural language sentences, as there is no significance to the order of atoms in a chemical formula as there is for the order of words in a sentence.

3 Experiments

3.1 Tasks

A number of diverse materials ML tasks are utilized to evaluate the effectiveness of the pooled atom vector representations, and the quality of the SkipAtom representation. In total, ten previously described tasks are utilized, and are broadly divided into two categories: those used for evaluating the pooling approach, and those used for evaluating the SkipAtom approach. To evaluate the pooling approach, nine tasks are chosen, and are described in Table 1.

Task Type Examples Structure? Method
Band Gap (eV) Regression 4,604 No Experiment [45]
Band Gap (eV) Regression 106,113 Yes DFT-GGA [17, 30]
Bulk Modulus (log(GPa)) Regression 10,987 Yes DFT-GGA [7]
Shear Modulus (log(GPa)) Regression 10,987 Yes DFT-GGA [7]
Refractive Index () Regression 4,764 Yes DFPT-GGA [31]
Formation Energy (eV/atom) Regression 275,424 Yes DFT [19, 38]
Bulk Metallic Glass Formation Classification 5,680 No Experiment [43, 21]
Metallicity Classification 4,921 No Experiment [45]
Metallicity Classification 106,113 Yes DFT-GGA [17, 30]
Table 1: The predictive tasks utilized in this study to evaluate the atom vector pooling approach. All datasets and benchmarks for the tasks above are described in [10], with the exception of the Formation Energy task, which is described in [19].

The tasks were chosen to represent the various scenarios encountered in materials data science, such as the availability of both smaller and larger datasets, the need for either regression or classification, the availability of material structure information, and the means (experiment or theory) by which the data is obtained. The OQMD (Open Quantum Materials Database) Formation Energy task

[19, 38] requires a different training protocol, as it was derived from a different study than the other eight tasks that are used for the pooling approach, which were sourced from the Matbench test suite [10].

To evaluate the SkipAtom representation, the Elpasolite Formation Energy task was utilized. The task and the model were initially described in the paper that introduced Atom2Vec (an alternative approach for learning atom vectors) [44]. The task consists of predicting the formation energy of elpasolites, which are comprised of a quaternary crystal structure, and have the general formula ABC2D6. The target formation energies for 5,645 examples were obtained by DFT [11]. The input consists of a concatenated sequence of atom vectors, each representing the A, B, C, and D atoms. We reproduce the approach here, for comparison against the Atom2Vec results.

All tasks require a representation of a material as input, and produce a prediction of a physical property as output, in either a regression or classification setting. Moreover, with the exception of the Elpasolite Formation Energy task, all tasks make use of the same model architecture (described in detail below).

3.2 Protocols

For the purposes of evaluation, the atom and compound vectors were utilized as inputs to feed-forward neural networks. All results for evaluating the pooling approach were obtained using a 17-layer feed-forward neural network architecture based on ElemNet


. The network was comprised of 4 layers with 1,024 neurons, followed by 3 layers with 512 neurons, 3 layers with 256 neurons, 3 layers with 128 neurons, 2 layers with 64 neurons, and 1 layer with 32 neurons, all with ReLU activation. For regression tasks, the output layer consisted of a single neuron and linear activation. For classification tasks, the output layer consisted of a single neuron and sigmoid activation (as only binary classification was performed). Instead of using dropout layers for regularization, as in the ElemNet approach, L2 regularization was used, with a regularization constant of

. The goal during training was to minimise the Mean Absolute Error loss (for regression tasks), or the Binary Cross-entropy loss (for classification tasks). All pooling approach experiments utilized a mini-batch size of 32, and a learning rate of along with the Adam optimizer (with an epsilon parameter of ) [22]. As described in the paper that introduces the Matbench test set [10], -fold cross-validation was performed to evaluate the compound vectors in regression tasks, with the same random seed to ensure the same splits were used each time. For classification tasks, stratified

-fold cross-validation was performed. As required by the benchmarking protocol, 5 splits were used (with the exception of the OQMD Formation Energy prediction task, which used 10 splits). Because the variance was high for some tasks after

-fold cross-validation, repeated -fold cross-validation was performed, to reduce the variance [29]

. All training was carried out for 100 epochs, and the best performing epoch was chosen as the result for that split. By following this protocol, a direct and fair comparison can be made to results reported previously using the same Matbench test set


The results for evaluating the SkipAtom approach were obtained using the Elpasolite neural network architecture and protocol, originally described in the paper that introduces Atom2Vec [44]. The input to the neural network is a vector constructed by concatenating 4 atom vectors, representing each of the 4 atoms in an Elpasolite composition. The single hidden layer consists of 10 neurons, with ReLU activation. The output layer consists of a single neuron, with linear activation. L2 regularization was used, with a regularization constant of . The goal during training was to minimise the Mean Absolute Error loss. The training protocol differs slightly in this report, and 10-fold cross-validation was performed, utilizing the result after 200 epochs of training. The same random seed was used for all experiments, to ensure the same splits were utilized. A mini-batch size of 32 was utilized, and a learning rate of along with the Adam optimizer (with an epsilon parameter of ) was chosen [22].

Learning of the SkipAtom vectors involved the use of the Materials Project database [18]. To assemble the training set, 126,335 inorganic compound structures were downloaded from the database. Each of these structures was converted into a graph representation using an approach based on a Voronoi algorithm, and a dataset of co-occurring atom pairs was derived. A total of 15,360,652 atom pairs were generated, utilizing 86 distinct atom types. The architecture consisted of a single hidden layer with linear activation, whose size depended on the desired dimensionality of the learned embeddings, and an output layer with 86 neurons (one for each of the utilized atom types) with

activation. The training objective consisted of minimizing the cross-entropy loss between the predicted context atom probabilities and the one-hot vector representing the context atom, given the one-vector representing the target atom as input. Training utilized stochastic gradient descent with the Adam optimizer, with a learning rate of

and a mini-batch size of 1,024, for 10 epochs.

3.3 Results and Discussion

A common technique for making high-dimensional data easier to visualize is t-SNE (t-Stochastic Neighbour Embedding)

[42]. Such a technique reduces the dimensionality of the data, typically to 2 dimensions, so that it can be plotted. Visualizing learned distributed representations in this way can provide some intuition regarding the quality of the embeddings and the structure of the learned space. In Figure 2, the 200-dimensional learned SkipAtom vectors are plotted after utilizing t-SNE to reduce their dimensionality to 2. It is evident that there is a logical structure to the data. We see that the alkali metals are clustered together, as are the light non-metals, for example. The relative locations of the atoms in the plot reflect chemo-structural nuances gleaned from the dataset, and are not arbitrary.

Figure 2: Dimensionally reduced SkipAtom atom vectors with an original size of 200 dimensions. The vectors were reduced to 2 dimensions using t-SNE.

To properly evaluate the quality of a learned distributed representation, they are utilized in the context of a task, and their performance compared to other representations. Here, we use the Elpasolite Formation Energy prediction task, and compare the performance of the SkipAtom vectors to the performance of other representations, namely, to Random vectors, One-hot vectors, Mat2Vec and Atom2Vec vectors. In the original study that introduced the task, atom vectors were 30- and 86-dimensional. We trained SkipAtom vectors with the same dimensions, and also with 200 dimensions, and evaluated them. The results are summarized in Table 2.

Representation Dim MAE (eV/atom)
Atom2Vec 30 0.1477 0.0078
SkipAtom 30 0.1183 0.0050
Random 30 0.1701 0.0081
Atom2Vec 86 0.1242 0.0066
One-hot 86 0.1218 0.0085
SkipAtom 86 0.1126 0.0078
Random 86 0.1190 0.0085
Mat2Vec 200 0.1126 0.0058
SkipAtom 200 0.1089 0.0061
Random 200 0.1158 0.0050
Table 2: Elpasolite Formation Energy prediction results after 10-fold cross-validation; mean best formation energy MAE on the test set after 200 epochs of training in each fold. Batch size was 32, learning rate was 0.001. Note that Dim refers to the dimensionality of the atom vector; the size of the input vector is 4 Dim. All results were generated using the same procedure on identical train/test folds.

For all embedding dimension sizes, SkipAtom outperforms the other representations on the Elpasolite Formation Energy task (Mat2Vec vectors were only available for this study in 200 dimensions, and Atom2Vec vectors, by virtue of how they are created, cannot have more dimensions than atom types represented). In Figure 3, a plot of how the mean absolute error changes during training demonstrates that the SkipAtom representation achieves better results from the beginning of training, and maintains the performance throughout.

Figure 3: A plot of the mean absolute error during training for the Elpasolite Formation Energy prediction task, for the Atom2Vec and SkipAtom representations. The average MAE over 10 folds is plotted.

Similar to atom vectors, compound vectors formed by the pooling of atom vectors can be dimensionally reduced, and visualized with t-SNE, or with PCA (Figure 4a). In Figure 4b, a sampling of several thousand compound vectors, formed by the sum-pooling of one-hot vectors, were reduced to 2 dimensions using t-SNE, and plotted. Additionally, since each compound vector represents a compound in the OQMD dataset, which contains associated formation energies, a color is assigned to each point in the plot denoting its formation energy. A clear distinction can be made across the spectrum of compounds and their formation energies. The vector representations derived from the composition of atom vectors appear to have preserved the relationship between atomic composition and formation energy.

Figure 4: a Plot of 200-dimensional SkipAtom vectors for Cr, Ni, and Zr, and their mean-pooled oxides, dimensionally reduced using PCA. b Plot of a sampling of the dimensionally-reduced compound vectors for the OQMD Dataset Formation Energy task, mapped to their associated physical values. The points are sum-pooled one-hot vectors reduced using t-SNE with a Hamming distance metric. The sum-pooled one-hot representation was the best performing for the task.

Again, as with atom vectors, the quality of a compound vector is best established by comparing its performance in a task. To evaluate the quality of pooled atom vectors, 9 predictive tasks were utilized, as described in Table 1. The performance on the benchmark regression tasks is summarized in Table 3, and the performance on the benchmark classification tasks is summarized in Table 4. Finally, the performance on the OQMD Formation Energy prediction task is summarized in Table 5.

Representation Dim EBG TBG BM SM RI
SkipAtom 86 0.3495(20) 0.2791(8) 0.0789(2) 0.1014(1) 0.3275(4)
Atom2Vec 86 0.3922(87) 0.2692(8) 0.0795(5) 0.1029(0) 0.3308(16)
Bag-of-Atoms / One-hot 86 0.3797(22) 0.2611(8) 0.0861(2) 0.1137(5) 0.3576(2)
ElemNet / One-hot 86 0.4060(72) 0.2582(3) 0.0853(1) 0.1155(1) 0.3409(16)
One-hot 86 0.3823(46) 0.2603(4) 0.0861(3) 0.1140(2) 0.3547(13)
Random 86 0.4109(58) 0.3180(16) 0.0908(4) 0.1195(2) 0.3593(6)
Mat2Vec 200 0.3529(7) 0.2741(2) 0.0776(0) 0.1014(2) 0.3236(17)
SkipAtom 200 0.3487(85) 0.2736(8) 0.0785(0) 0.1014(0) 0.3247(15)
Random 200 0.4058(4) 0.3083(21) 0.0871(1) 0.1163(2) 0.3543(6)
Table 3: Benchmark regression task results after 2-repeated 5- or 10-fold cross-validation; mean best MAE on the test set after 100 epochs of training in each fold. All results were generated using the same procedure on identical train/test folds. TBG refers to the Theoretical Band Gap task (MAE in eV), BM to the Bulk Modulus task (MAE in log(GPa)), SM to the Shear Modulus task (MAE in log(GPa)), and RI to the Refractive Index task (MAE in

). These tasks make use of structure information. EBG refers to the Experimental Band Gap task (MAE in eV), and it makes use of composition only. Only the best results for each representation are reported. The pooling procedure varies between results; blue results represent sum-pooling, red results represent mean-pooling, and teal results represent max-pooling. Numbers in parentheses represent the standard deviation to one part in

. See the Supplementary Information tables for more detailed results.
Representation Dim TM BMGF EM
SkipAtom 86 0.9520 0.0002 0.9436 0.0010 0.9645 0.0012
Atom2Vec 86 0.9526 0.0001 0.9316 0.0012 0.9582 0.0008
Bag-of-Atoms / One-hot 86 0.9490 0.0002 0.9277 0.0004 0.9600 0.0012
ElemNet / One-hot 86 0.9477 0.0001 0.9322 0.0014 0.9485 0.0007
One-hot 86 0.9487 0.0003 0.9289 0.0016 0.9599 0.0014
Random 86 0.9444 0.0000 0.9274 0.0006 0.9559 0.0021
Mat2Vec 200 0.9528 0.0002 0.9348 0.0024 0.9655 0.0014
SkipAtom 200 0.9524 0.0001 0.9349 0.0019 0.9645 0.0008
Random 200 0.9453 0.0001 0.9302 0.0016 0.9541 0.0002
Table 4: Benchmark classification task results after 2-repeated 5-fold stratified cross-validation; mean best ROC-AUC on the test set after 100 epochs of training in each fold. All results were generated using the same procedure on identical train/test folds. TM refers to the Theoretical Metallicity task, and makes use of structure information. BMGF refers to the Bulk Metallic Glass Formation task, and EM to the Experimental Metallicity task. These last two do not make use of structure information. Only the best results for each representation are reported. The pooling procedure varies between results; blue results represent sum-pooling, red results represent mean-pooling, and teal results represent max-pooling. See the Supplementary Information tables for more detailed results.
Representation Dim Pooling MAE (eV/atom)
SkipAtom 86 sum 0.0420 0.0005
Atom2Vec 86 sum 0.0396 0.0004
Bag-of-Atoms / One-hot 86 sum 0.0388 0.0002
ElemNet / One-hot 86 mean 0.0427 0.0007
Random 86 sum 0.0440 0.0004
Mat2Vec 200 sum 0.0401 0.0004
SkipAtom 200 sum 0.0408 0.0003
Random 200 sum 0.0417 0.0004
Table 5: OQMD Dataset Formation Energy prediction results after 10-fold cross-validation; mean best formation energy MAE on the test set after 100 epochs of training in each fold. All results were generated using the same procedure on identical train/test folds.

In the benchmark regression and classification task results, there isn’t a clear atom vector or pooling method that dominates. The 200-dimensional representations generally appear to perform better than the smaller 86-dimensional representations. Though not evident from Tables 3 and 4, sum- and mean-pooling outperform max-pooling (see Supplementary Information tables). The pooled Mat2Vec representations are notable, in that they achieve the best results in 4 of the 8 benchmark tasks, while pooled SkipAtom representations are best in 2 of the 8 benchmark tasks. Pooled Random vectors tend to under-perform, though not always by a very large margin. This may not be so surprising, since random vectors exhibit quasi-orthogonality as their dimensionality increases, and thus may have the same functional characteristics as one-hot vectors. [20] On the OQMD Formation Energy prediction task, the Bag-of-Atoms representation yields the best results, significantly outperforming both the distributed representations, and the mean-pooled one-hot representation originally used in the ElemNet paper, that introduced the task.

A noteworthy aspect of these results is how the pooled atom vector representations compare to the published state-of-the-art values on the 8 benchmark tasks from the Matbench test suite. Figure 5 depicts this comparison. Indeed, the models described in this report outperform the existing benchmarks on tasks where only composition is available (namely, the Experimental Band Gap, Bulk Metallic Glass Formation, and Experimental Metallicity tasks), and represent new state-of-the-art results. Also, on the Theoretical Metallicity task and the Refractive Index task, the pooled SkipAtom, Mat2Vec and one-hot vector representations perform comparably, despite making use of composition information only.

Figure 5: A comparison between the results of the methods described in the current work and existing state-of-the-art results on benchmark tasks. TBG refers to the Theoretical Band Gap task (MAE in eV), BM to the Bulk Modulus task (MAE in log(GPa)), SM to the Shear Modulus task (MAE in log(GPa)), RI to the Refractive Index task (MAE in ), and TM to the Theoretical Metallicity task (ROC-AUC). These tasks make use of structure information. EBG refers to the Experimental Band Gap task (MAE in eV), BMGF to the Bulk Metallic Glass Formation task (ROC-AUC), EM to the Experimental Metallicity task (ROC-AUC). These tasks make use of composition only. The results that are outlined in bold represent the best score for that task. Italicized results represent a new state-of-the-art result. As described in the Protocols section of this report, the same methodology was used to obtain the results for all of the algorithms in the table.

4 Conclusions

NLP researchers have learned many lessons regarding the computational representations of words and sentences. It could be fruitful for computational materials scientists to borrow techniques from the study of Computational Linguistics. Above, we have described how making an analogy between words and sentences, and atoms and compounds, allowed us to borrow both a means of learning atom representations, and a means of forming compound representations by pooling operations on atom vectors. Consequently, we draw the following conclusions: i) effective computational descriptors of atoms can be derived from freely available and growing materials databases; ii) effective computational descriptors of compounds can be easily constructed by straightforward pooling operations of the atom vectors of the constituent atoms; iii) properties of materials can often be accurately predicted even where only chemical composition information is available.

The SkipAtom representation can be derived from a dataset of readily accessible compound structures. Moreover, the training process is lightweight enough that it can be performed on a good quality laptop on a scale of minutes to several hours (given the atom pairs). This highlights some important differences between SkipAtom and Atom2Vec and Mat2Vec. Training of the Mat2Vec representation requires the curation of millions of journal abstracts, and a subsequent classification step for retaining only the most relevant abstracts. Additionally, pre-processing of the tokens in the text must be carried out to identify valid chemical formulae through the use of custom rules and regular expressions. On the other hand, since SkipAtom makes direct use of the information in materials databases, no special pre-processing of the chemical information is required. Although the procedures for creating Mat2Vec and SkipAtom vectors have been incorporated into publicly available software libraries, the conceptually simpler SkipAtom approach leaves little room for ambiguity that might result from manually written chemical information extraction rules. When compared to Atom2Vec, a principal difference is that SkipAtom vectors are not limited in size by the number of atom types available. This allows larger SkipAtom vectors to be trained, and, as is evident from the results described above, larger vectors generally perform better on tasks. Overall, we believe SkipAtom is a more accessible tool for computational materials scientists, allowing them to readily train expressive atom vectors on chemical databases of their choosing, and to take advantage of the growing information in these databases over time.

The ElemNet architecture demonstrated that the incorporation of composition information alone could result in good performance when predicting chemical properties. In this work, we have extended the result, and shown how such an approach performs in a variety of different tasks. Perhaps surprisingly, the combination of a deep feed-forward neural network with compound representations consisting of composition information alone results in competitive performance when comparing to approaches that make use of structural information. We believe this is a valuable insight, since high-throughput screening endeavours, in the search for new materials with desired properties, often target areas of chemical space where only composition is known. We envision performing large sweeps of chemical space, in relatively shorter periods of time, since structural characteristics of the compounds would not need to be computed, and only composition would be used. The results presented here could motivate more extensive and computationally cheaper screening.

Going forward, there are a number of different avenues that can be explored. First, the atom vectors generated using the SkipAtom approach can be explored in different contexts, such as in combination with structural information. For example, graph neural networks, such as the MEGNet architecture [4], can accept as input any atom representation one chooses. It would be interesting to see if starting with pre-trained SkipAtom vectors could improve the performance of these models, where structure information is also incorporated. Alternatively, chemical compound vectors formed by pooling SkipAtom vectors can be directly concatenated with vectors that contain structure information, thus complementing the pooled atom vectors with more information. A candidate for encoding structure information is the Coulomb Matrix (in vectorized form), a descriptor which encodes the electrostatic interactions between atomic nuclei. [37] Finally, one limitation of the SkipAtom approach is that it does not provide representations of atoms in different oxidation states. Since it is (often) possible to unambiguously infer the oxidation states of atoms in compounds, it is, in principle, possible to construct a SkipAtom training set of pairs of atoms in different oxidation states. The number of atom types would increase by several fold, but would still be within limits that allow for efficient training. It would be interesting to explore the results of forming compound representations using such vectors for atoms in various oxidation states.

5 Code Availability

The code for creating and using the SkipAtom vectors is open source, released under the GNU General Public License v3.0. The code repository is accessible online, at:

The repository also contains pre-trained 200-dimensional SkipAtom vectors for 86 atom types that can be immediately used in materials informatics projects.

6 Data Availability

The data that support the findings of this study are available as follows: The materials data that was used to learn the SkipAtom embeddings are publicly available online at https://materialsproject.org/. The elpasolite formation energy training data are publicly available online at https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.117.135502, in the Supplemental Material section. The datasets comprising the Matbench tasks are publicly available at https://hackingmaterials.lbl.gov/automatminer/datasets.html. The Mat2Vec pre-trained embeddings are publicly available online and can be downloaded by following the instructions at https://github.com/materialsintelligence/mat2vec. The Atom2Vec embeddings are publicly available online and can be obtained from

7 Author Contributions

L.M.A. conceived the project, designed and performed the experiments, and drafted the manuscript. R.G.-C. and K.T.B. supervised and guided the project. All authors reviewed, edited and approved the manuscript.

8 Competing Interests

The authors declare no competing interests.


  • [1] N. Artrith (2019) Machine learning for the modeling of interfaces in energy storage and conversion materials. Journal of Physics: Energy 1 (3), pp. 032002. Cited by: §1.
  • [2] K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, and A. Walsh (2018) Machine learning for molecular and materials science. Nature 559 (7715), pp. 547–555. Cited by: §1.
  • [3] S. K. Chakravarti (2018) Distributed Representation of Chemical Fragments. ACS Omega 3 (3), pp. 2825–2836. Cited by: §1.
  • [4] C. Chen, W. Ye, Y. Zuo, C. Zheng, and S. P. Ong (2019) Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals. Chemistry of Materials 31 (9), pp. 3564–3572. Cited by: §4.
  • [5] D. W. Davies, K. T. Butler, A. J. Jackson, A. Morris, J. M. Frost, J. M. Skelton, and A. Walsh (2016) Computational screening of all stoichiometric inorganic materials. Chem 1 (4), pp. 617–627. Cited by: §1.
  • [6] D. W. Davies, K. T. Butler, and A. Walsh (2019) Data-driven discovery of photoactive quaternary oxides using first-principles machine learning. Chemistry of Materials 31 (18), pp. 7221–7230. Cited by: §1.
  • [7] M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst, M. Sluiter, C. K. Ande, S. Van Der Zwaag, J. J. Plata, et al. (2015) Charting the complete elastic properties of inorganic crystalline compounds. Scientific data 2 (1), pp. 1–13. Cited by: Table 1.
  • [8] F. J. DiSalvo (2000) Challenges and opportunities in solid-state chemistry. Pure and Applied Chemistry 72 (10), pp. 1799–1807. Cited by: §1.
  • [9] C. Duan, F. Liu, A. Nandy, and H. J. Kulik (2021) Putting density functional theory to the test in machine-learning-accelerated materials discovery. The Journal of Physical Chemistry Letters 12, pp. 4628–4637. Cited by: §1.
  • [10] A. Dunn, Q. Wang, A. Ganose, D. Dopp, and A. Jain (2020) Benchmarking materials property prediction methods: the matbench test set and automatminer reference algorithm. npj Computational Materials 6 (1), pp. 1–10. Cited by: §3.1, §3.2, Table 1.
  • [11] F. A. Faber, A. Lindmaa, O. A. Von Lilienfeld, and R. Armiento (2016) Machine Learning Energies of 2 Million Elpasolite (ABC2D6) Crystals. Physical Review Letters 117 (13), pp. 135502. Cited by: §3.1.
  • [12] F. Faber, A. Lindmaa, O. A. von Lilienfeld, and R. Armiento (2015) Crystal structure representations for machine learning models of formation energies. International Journal of Quantum Chemistry 115 (16), pp. 1094–1101. Cited by: §1.
  • [13] B. R. Goldsmith, J. Esterhuizen, J. Liu, C. J. Bartel, and C. Sutton (2018) Machine Learning for Heterogeneous Catalyst Design and Discovery. AIChE Journal 64 (7), pp. 2311–2323. Cited by: §1.
  • [14] R. E. Goodall and A. A. Lee (2020) Predicting materials properties without crystal structure: Deep representation learning from stoichiometry. Nature Communications 11 (1), pp. 1–9. Cited by: §1.
  • [15] L. Himanen, A. Geurts, A. S. Foster, and P. Rinke (2019) Data‐Driven Materials Science: Status, Challenges, and Perspectives. Advanced Science 6 (21), pp. 1900808. Cited by: §1.
  • [16] G. E. Hinton, T. J. Sejnowski, et al. (1999) Unsupervised Learning: Foundations of Neural Computation. MIT press. Cited by: §1.
  • [17] A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, et al. (2013) Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL materials 1 (1), pp. 011002. Cited by: Table 1.
  • [18] A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, and K. a. Persson (2013) The Materials Project: A materials genome approach to accelerating materials innovation. APL Materials 1 (1), pp. 011002. External Links: Document, ISSN 2166532X, Link Cited by: §3.2.
  • [19] D. Jha, L. Ward, A. Paul, W. Liao, A. Choudhary, C. Wolverton, and A. Agrawal (2018) ElemNet: Deep Learning the Chemistry of Materials From Only Elemental Composition. Scientific Reports 8 (1), pp. 1–13. Cited by: §1, §2.2.2, §3.1, §3.2, Table 1.
  • [20] P. C. Kainen and V. Kurkova (2020) Quasiorthogonal Dimension. In Beyond Traditional Probabilistic Data Processing Techniques: Interval, Fuzzy etc. Methods and Their Applications, pp. 615–629. Cited by: §3.3.
  • [21] Y. Kawazoe (1997) Nonequilibrium Phase Diagrams of Termary Amorphous Alloys. LB: New Ser., Group III: Condensed 37, pp. 1–295. Cited by: Table 1.
  • [22] D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2, §3.2.
  • [23] Q. Le and T. Mikolov (2014) Distributed Representations of Sentences and Documents. In International conference on machine learning, pp. 1188–1196. Cited by: §1.
  • [24] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
  • [25] B. Meredig, A. Agrawal, S. Kirklin, J. E. Saal, J. Doak, A. Thompson, K. Zhang, A. Choudhary, and C. Wolverton (2014) Combinatorial screening for new materials in unconstrained composition space with machine learning. Physical Review B 89 (9), pp. 094104. Cited by: §1.
  • [26] S. D. Midgley, S. Hamad, K. T. Butler, and R. Grau-Crespo (2021) Bandgap engineering in the configurational space of solid solutions via machine learning:(mg, zn) o case study. The Journal of Physical Chemistry Letters 12, pp. 5163–5168. Cited by: §1.
  • [27] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781. Cited by: §1, §2.1.4.
  • [28] J. Mitchell and M. Lapata (2008) Vector-based Models of Semantic Composition. In Proceedings of ACL-08: HLT, pp. 236–244. Cited by: §1.
  • [29] H. Moss, D. Leslie, and P. Rayson (2018-08) Using J-K-fold Cross Validation To Reduce Variance When Tuning NLP Models. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2978–2989. Cited by: §3.2.
  • [30] S. P. Ong, S. Cholia, A. Jain, M. Brafman, D. Gunter, G. Ceder, and K. A. Persson (2015) The Materials Application Programming Interface (API): A simple, flexible and efficient API for materials data based on REpresentational State Transfer (REST) principles. Computational Materials Science 97, pp. 209–215. Cited by: Table 1.
  • [31] I. Petousis, D. Mrdjenovich, E. Ballouz, M. Liu, D. Winston, W. Chen, T. Graf, T. D. Schladt, K. A. Persson, and F. B. Prinz (2017) High-throughput screening of inorganic compounds for the discovery of novel dielectric and optical materials. Scientific data 4 (1), pp. 1–12. Cited by: Table 1.
  • [32] M. T. Pilehvar and N. Collier (2016) De-Conflated Semantic Representations. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1680–1690. Cited by: §2.1.5.
  • [33] M. T. Pilehvar and N. Collier (2017) Inducing Embeddings for Rare and Unseen Words by Leveraging Lexical Resources. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 388–393. Cited by: §2.1.5.
  • [34] Z. Qiao, M. Welborn, A. Anandkumar, F. R. Manby, and T. F. Miller III (2020) OrbNet: deep learning for quantum chemistry using symmetry-adapted atomic-orbital features. The Journal of Chemical Physics 153 (12), pp. 124111. Cited by: §1.
  • [35] R. Ramprasad, R. Batra, G. Pilania, A. Mannodi-Kanakkithodi, and C. Kim (2017) Machine learning in materials informatics: recent applications and prospects. npj Computational Materials 3 (1), pp. 1–13. Cited by: §1.
  • [36] A. Raza, A. Sturluson, C. M. Simon, and X. Fern (2020) Message passing neural networks for partial charge assignment to metal–organic frameworks. The Journal of Physical Chemistry C 124 (35), pp. 19070–19082. Cited by: §1.
  • [37] M. Rupp, A. Tkatchenko, K. Müller, and O. A. Von Lilienfeld (2012) Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning. Physical review letters 108 (5), pp. 058301. Cited by: §4.
  • [38] J. E. Saal, S. Kirklin, M. Aykol, B. Meredig, and C. Wolverton (2013) Materials design and discovery with high-throughput density functional theory: the open quantum materials database (OQMD). JOM 65 (11), pp. 1501–1509. Cited by: §3.1, Table 1.
  • [39] G. R. Schleder, A. C. Padilha, C. M. Acosta, M. Costa, and A. Fazzio (2019) From DFT to machine learning: recent approaches to materials science–a review. Journal of Physics: Materials 2 (3), pp. 032001. Cited by: §1.
  • [40] J. Schmidt, M. R. Marques, S. Botti, and M. A. Marques (2019) Recent advances and applications of machine learning in solid-state materials science. npj Computational Materials 5 (1), pp. 1–36. Cited by: §1.
  • [41] V. Tshitoyan, J. Dagdelen, L. Weston, A. Dunn, Z. Rong, O. Kononova, K. A. Persson, G. Ceder, and A. Jain (2019) Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571 (7763), pp. 95–98. Cited by: §1, §2.1.4.
  • [42] L. Van der Maaten and G. Hinton (2008) Visualizing Data using t-SNE. Journal of Machine Learning Research 9 (11). Cited by: §3.3.
  • [43] L. Ward, A. Agrawal, A. Choudhary, and C. Wolverton (2016) A general-purpose machine learning framework for predicting properties of inorganic materials. npj Computational Materials 2 (1), pp. 1–7. Cited by: Table 1.
  • [44] Q. Zhou, P. Tang, S. Liu, J. Pan, Q. Yan, and S. Zhang (2018) Learning atoms for materials discovery. Proceedings of the National Academy of Sciences 115 (28), pp. E6411–E6417. Cited by: §1, §2.1.3, §3.1, §3.2.
  • [45] Y. Zhuo, A. Mansouri Tehrani, and J. Brgoch (2018) Predicting the Band Gaps of Inorganic Solids by Machine Learning. The Journal of Physical Chemistry Letters 9 (7), pp. 1668–1673. Cited by: §1, Table 1.
  • [46] A. Zunger (2018) Inverse design in search of materials with target functionalities. Nature Reviews Chemistry 2 (4), pp. 1–16. Cited by: §1.