Fast classification of small X-ray diffraction datasets using data augmentation and deep neural networks

11/20/2018 ∙ by Felipe Oviedo, et al. ∙ MIT 0

X-ray diffraction (XRD) for crystal structure characterization is among the most time-consuming and complex steps in the development cycle of novel materials. We propose a machine-learning-enabled approach to predict crystallographic dimensionality and space group from a limited number of experimental thin-film XRD patterns. We overcome the sparse-data problem intrinsic to novel materials development by coupling a supervised machine-learning approach with a physics-based data augmentation strategy . Using this approach, XRD spectrum acquisition and analysis occurs under 5.5 minutes, with accuracy comparable to human expert labeling. We simulate experimental powder diffraction patterns from crystallographic information contained in the Inorganic Crystal Structure Database (ICSD). We train a classification algorithm using a combination of labeled simulated and experimental augmented datasets, which account for thin-film characteristics and measurement noise. As a test case, 88 metal-halide thin films spanning 3 dimensionalities and 7 space-groups are synthesized and classified. The accuracies and throughputs of multiple machine-learning techniques are evaluated, along with the effect of augmented dataset size. The most accurate classification algorithm is found to be a feed-forward deep neural network. The calculated accuracies for dimensionality and space-group classification are comparable to ground-truth labelling by a human expert, approximately 90% and 85%, respectively. Additionally, we systematically evaluate the maximum XRD spectrum step size (data acquisition rate) before loss of predictive accuracy occurs, and determine it to be 0.16 2θ , which enables an XRD spectrum to be obtained and analyzed in 5 minutes or less.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

High-throughput material synthesis and rapid characterization are necessary ingredients for inverse design and accelerated material discovery [1], [2]. X-ray diffraction (XRD) is a workhorse technique to determine crystallography and phase information, including lattice parameters, crystal symmetry, phase composition, density, space-group, and dimensionality [3]. This is achieved by mapping XRD patterns for a novel material to the measured or simulated XRD patterns of known materials [4]. Despite its indispensable utility, XRD is a common bottleneck in materials-characterization loops. Up to one hour is typically required for thin-film XRD data acquisition for a scan with high angular resolution, and another one to two hours are typically required for Rietveld refinement by an expert crystallographer when the crystalline phases are known. It is widely recognized that machine learning methods have potential to accelerate this process; however, practical implementations have thus far focused on well-established materials [5]–[7], require combinatorial datasets spanning among various phases [8], [9] , or require large datasets [5], [10], while research using the inverse design paradigm often involves less studied materials, smaller data sets, and mixed-phase early-stage prototypes. In this study, we seek to apply machine learning methods to the XRD characterization of early-stage materials, using a data augmentation strategy that simulates thin-film diffraction patterns from crystallography data. We also employ faster data acquisition speeds without sacrificing quality of information, and compare average accuracy between different machine learning techniques.

Typically, raw XRD pattern data is simplified by obtaining spectral descriptors like peak shape, height and position, using peak profile functions like pseudo-Voigt or more elaborate continuous spectrum methods like Rietveld refinement [4], [11]. In the latter, multiple full-profile spectral descriptors are matched to descriptors from known XRD patterns, allowing the identification of the structural properties of the material [4]. For novel compounds, the crystalline structure is commonly unknown, limiting the efficacy of the Rietveld refinement. The direct-space method, statistical methods, and the growth of single crystals have been used to obtain crystal symmetry information for novel materials [7], [12]–[15], but the significant iteration time, feature engineering, human expertise, and knowledge of specific material required makes these methods impractical for high-throughput experimentation, where sample characterization rates are of the order of one material per minute or faster [2], [16], explored over various material families.

An alternative approach consists in using machine learning methods to obtain more robust spectral descriptors and quickly classify crystalline structure based on the peak periodicity in the XRD pattern. A successful method [10] uses convolutional neural networks (CNN) trained with hundreds of thousands of XRD powder patterns simulated with data from the Inorganic Crystalline Structure Database (ICSD). Further CNN and other deep-learning based methods have been employed to obtain crystalline information for other kinds of diffraction data [17], [18]. In couple studies, noise-based data augmentation has been used to avoid overfitting in a broader kind of X-ray characterization problems [18], [19], however the augmentation procedure has not been based in physical knowledge.

While similar approaches produce good results for crystal structure classification, we have found that applying them to high-throughput characterization of novel solution-processed compounds is generally not practical, given the limited access to large datasets of clean, pre-processed, relevant, XRD spectra. Furthermore, most materials of interest developed in high-throughput synthesis loops are thin film materials. The preferred orientation of the crystalline planes in thin-films causes their experimental XRD patterns to differ from the simulated XRD powder patterns available in most databases [20], [21]. Thin-film compounds usually will present spectrum shifting and periodic scaling of peaks in preferred orientations, reducing the accuracy of machine learning models trained with powder data [8], [9], [22].

Considering these challenges, we propose a supervised machine learning approach for rapid crystal structure identification of novel materials from thin-film XRD measurements. For this work, we created a library of 96 XRD patterns of thin-film halide materials extracted from the >100,000 compounds available in the ICSD [23]; these 96 XRD patterns include lead-halide perovskite[24], [25] and lead-free perovskite-inspired materials [26]. These XRD patterns were manually classified among different crystal dimensionalities. Based on this small dataset of relevant XRD powder patterns extracted from ICSD and an additional 88 experimental XRD patterns, we perform physics-based data augmentation to generate a suitable and robust training set for thin-film materials, and subsequently test the space group and dimensionality classification accuracy of multiple machine learning algorithms. A feed-forward deep neural network [27], [28] is identified as the most accurate classifier for this problem. Subsequently, the effect of the augmented dataset size and the XRD pattern granularity is investigated.

2 Methods

2.1 Framework

Figure 1: Schematic of our X-ray diffraction data classification framework, with physics based data augmentation.

The framework developed for rapid classification of XRD thin-film patterns according to crystal descriptors is shown in Figure 1. A simulated dataset is defined by extracting crystal structure information from ICSD as explained in Section 2.3. The experimental dataset consists of a set of synthesized samples, which are manually labelled for training and testing purposes. The datasets are subjected to data augmentation based on the three spectral transformations shown in Figure 1. The methodology makes use of both experimental and simulated XRD patterns to train a machine learning classification algorithm.

In this specific study, the framework relies on the relation between XRD periodicity and the crystal descriptors of interest: dimensionality and space-group. For example, among perovskite-inspired materials for photovoltaic applications, 3D cubic lead halide perovskites of multiple compositions show distinct features in XRD pattern periodicity compared to 2D layered bismuth perovskites [29], [30].

The crystal descriptors of interest in this work, space-group and crystal dimensionality, are chosen because of their importance for material screening in accelerated material development. In many inorganic material systems, the crystalline dimensionality — i.e., a generalization of the crystalline symmetry into 0-dimensional (0D), 1D, 2D or 3D symmetry— constitutes a figure of merit for experimental material screening as it correlates with observed charge-transport properties.[31] In perovskites and perovskites-inspired materials, for instance, 3D crystalline structures have been shown to have good carrier-transport properties for solar cells and LED applications,[31], [32] while 3D-2D mixtures have been found to have greater stability in lead halide perovskites than pure-phase 3D crystals.[33] With further detail, the space-group number describes the standardized symmetry group of a configuration in space, classifying crystal symmetries into 230 groups. Identifying the space-group number of a sample provides crystal information beyond dimensionality, including atomic bonding angles and relative distances, which are believed to be of importance for predicting material properties [34].

Typically, the powder XRD pattern (pXRD) is commonly used to identify space-group through Rietveld refinement, but the compression of crystalline three-dimensional crystallographic information into a one-dimensional diffraction causes the space-group to be impossible to determine unambiguously in certain low-symmetry phases, independently of the measurement technique [35]. In this work, the space-groups of interest are able to be determined from pXRD information only.

To better account for noise measurement and the physical difference between randomly oriented powder patterns and experimental thin-film patterns, the patterns were subjected to a process of data augmentation based on domain knowledge, as explained in Section 2.4. Subsequently, both augmented experimental and simulated XRD pattern datasets are used for testing, training and cross validation of machine learning algorithms. The classification accuracy for each method subjected to 5-fold cross validation is presented in Section 3. The sensitivity of the model to the size of the augmented dataset is also considered in Section 3.

2.2 Experimental measurement and labelling of XRD patterns

The experimental dataset consists of diffraction patterns of 88 compounds. For this work, perovskite-inspired 3D materials based on lead halide perovskites (space-group: ), tin halide perovskite (),cesium silver bismuth bromide double perovskite (), bismuth and antimony halide 2D (, , ), and 0D () perovskite-inspired materials are synthesized and used as training and testing dataset. The details about the synthesis and characterization methodology are described in our parallel study [36].

The XRD patterns for each sample are obtained by using an X-ray powder diffraction Rigaku SmartLab system [37] with angle from to with a step size of . The tool is configured in a symmetric setup. To define the ground-truth labels for each, the XRD patterns are subjected to peak indexing, and the dimensionality and space group are confirmed from information contained in ICSD. Due to the nature of the thin film samples, Rietveld refinement is restricted due to the preferred orientation. Therefore, in this study, space group and lattice parameters in ICSD was used as a reference to confirm the synthesized crystal structure.

We pre-process the raw XRD patterns to reduce the experimental noise and the background signal. For this purpose, the background signal is estimated and subtracted along the

axis, and the spectrum is smoothed conserving the peak width and relative peak size applying the Savitzky Golay filter [38].

2.3 Data mining and simulation of XRD patterns

The simulated training dataset consists of 96 compounds extracted from ICSD with a similar composition, expected crystal symmetry, and space-group as the synthesized materials of interest. All the possible single, double, ternary, and quaternary combinations of the elements of interest were extracted during database mining.

The fundamental crystal descriptors extracted from the material database are used to simulate XRD random powder patterns. The simulations are carried out with Panalytical Highscore v4.7 software based on the Rietveld algorithm implementation by Hill and Howard [39], [40]. The unit cell lattice parameters, atomic coordinates, atomic displacement parameters, and space group information are considered for the structure factor calculation in the Rietveld model.

2.4 Data Augmentation based on domain knowledge

Figure 2: Schematic of the physics-based data augmentation strategy accounts for the particularities of thin-film XRD spectra, as described in Eqs. 1–3.

To increase the size and robustness of the training dataset and to account for fundamental differences between real thin-films and simulated XRD powder spectra, we perform data augmentation based on physical domain knowledge, as summarized with representative patterns in Figure 2. Due to expansions and contractions in the crystalline lattice, XRD peaks shift along the axis according to the specific size and location of the different elements present in a compound, while maintaining similar periodicity based on crystal space-group[3] [8], [22], [41]. In addition, for thin-film samples, the XRD pattern can be shifted due to strain in the film induced during the fabrication process [42].

Polycrystalline thin-films are known to have preferred orientations along certain crystallographic planes. The preferred orientation is influenced by the crystal growth process and substrate [43], and is common for most solution- processing and vapor-deposition fabrication methods. Ideal random powders contain multiple grains without any preferred global orientations, thus all crystallographic orientations are represented evenly in the peak intensity and periodicity of the XRD pattern. As a consequence of their preferred orientations along crystallographic planes, thin film XRD relative peak intensities are scaled up periodically in the preferred plane orientation, and scaled down periodically or even eliminated in the non-preferred orientations.

Here is a mathematical description of our data augmentation approach. Suppose we describe the series of peaks in an XRD pattern by a discrete function , which maps a set of discrete angles to positive real numbers corresponding to peak intensities. We augment the data through the following sequential process of transformations , and :

  1. Random peak scaling is applied periodically along the axis to account for different thin-film preferred orientations. A subset of random peaks at periodic angles is scaled by factor , such that:

  2. Random peak elimination (with a different randomly-selected than Eq. 1) is applied periodically along the axis, to account for different thin-film preferred orientations, such that:

  3. Spectrum shifting by small random value along the direction to allow for different material compositions and film strain conditions, such that:


A sensitivity analysis of the training data size and its impact on reported accuracy for both the experimental and theoretical dataset augmentations are included in the Results and Discussion section.

2.5 Classification algorithms

Pre-processed experimental data and augmented simulated data are fed into various supervised machine learning algorithms for training and testing purposes. The best-performing algorithm is evaluated on the basis of model accuracy and speed. The XRD patterns are classified into 3 crystal dimensionalities (0D, 2D, and 3D) and 7 space-groups (, , , , , , ).

For this purpose, we represent the XRD pattern as either a vector or a time series. For each kind of data representation, different classification algorithms are considered. Using a vector representation of the XRD pattern, the following classification methods are tested: Naïve Bayes, k-Nearest Neighbors, Logistic Regression, Random Forest, Decision Trees, Support Vector Machine, Gradient Boosting and a Feed-Forward Neural Network (vanilla algorithms described in [27], [28], [44] ). The XRD patterns are also analyzed as a time series with a normalized Dynamic Time Warping (DTW) distance metric [45] combined with a k-Nearest Neighbors classification algorithm, which was found in literature as the most adequate metric for measuring similarity among metal-alloy XRD spectra [7], [22]. Adequate hyperparameter tuning was performed for each method.

To avoid inflated metrics caused by class imbalance, in this work we used a class-balanced accuracy metric defined as the macro-average of recall scores per class. We measure the average class-balanced accuracy of the dimensionality and space-group classification methods based on three different approaches of splitting the training and testing datasets:

  • Case 1: Exclusively simulated XRD patterns are used for testing and training. 5-fold cross validation is performed to estimate the accuracy of each classification technique.

  • Case 2: The simulated XRD patterns are used for training, and the experimental patterns for known materials are used for testing.

  • Case 3: All of the simulated data and 80% of the experimental data are used for training, and 20% of the experimental data are used for testing. 5-fold cross validation is performed to estimate the accuracy of each classification technique.

3 Results and Discussion

Each one of the training/testing cases mentioned in Section II.E are tested for crystal dimensionality and space-group prediction accuracy. The results, in addition to the average run time of each algorithm, are reported in Table 1. In each cell, the percentage crystal dimensionality classification accuracy is reported first, followed by the accuracy for space-group classification. Case 1, presenting 5-fold cross validation results of the simulated dataset, has the highest accuracy as it does not predict any experimental data and thus is free of experimental errors for both crystal descriptors. Case 2 performs the experimental prediction solely based on simulated patterns, thus having the lowest accuracy. Finally, Case 3 has a significant higher accuracy than Case 2 for both crystal dimensionality and space-group prediction.

Table 1: Reported accuracy and run time after 5-fold cross validation for each machine learning classification technique.

In general, the model’s accuracy is lower for space-group classification that for crystal dimensionality classification. This discrepancy is caused by the lower number of per-class labelled examples for space-group classification compared to crystal dimensionality classes. Class imbalance can also systematically affect the performance of the classifier, to avoid this issue, we performed an oversampling test with synthetic training data according to [46], [47], and observed little discrepancy of accuracy between the balanced and imbalanced datasets after 5-fold cross validation.

The use of experimental data as part of the training set increases the model accuracy and robustness. This fact can be explained by the high variability of experimental thin-film XRD patterns, even after data pre-processing. The relatively high accuracy with the relatively small number of experimental samples (in the order of 10 to a 100) confirms the potential of our data augmentation strategy to yield high predictive accuracies even with small datasets.

For all three test cases, the deep neural network (DNN) classifier performs better than any other classification technique. The NN architecture used is a fully-connected feed-forward neural network with three hidden-layers of 256 neurons each using rectifying linear unit (ReLU) activation functions for the hidden-layers and Softmax for the output layer. The weights are optimized by stochastic gradient descent to minimize a standard log-loss function with L2 regularization (

) [27], [28].

The DNN trained after data augmentation has an accuracy of 92.1% and 84.1% for crystal dimensionality and space-group classifications, respectively. The accuracy is comparable to results found in literature for space-group classification through convolutional neural network trained with thousands of ICSD patterns and manual labelling by human experts [10], [48], and is also comparable to similar approaches in other kinds of diffraction data [8], [17]. The neural network seems to be the most adequate method for high-throughput synthesis and characterization loops, as it also performs relatively well in terms of algorithm speed. In the future, our methodology can be extended to other materials systems, and may include other crystal descriptors as predicted outputs, such as lattice parameters and atomic coordinates.

Figure 3: Line plot showing mean Case 3 (Section 2.5) DNN accuracy, as a function of the number of augmented spectra (Eqs. 1–3) included in the training set. The x-axis shows augmented experimental data (based on the original 88 experimental XRD spectra), and the legend shows simulated data (based on the 96 simulated powder-diffraction spectra obtained from the ICSD).

Furthermore, the DNN performs better that k-nearest neighbors using DTW. In our test case and dataset, the differences between thin-film and powder spectra seem not to be captured properly by DTW alone. Also, most common implementations of DTW are computationally expensive [45], making the method impractical for analyzing augmented data. Arguably, DTW could be more useful when a larger XRD thin-film pattern dataset is available for k-Nearest Neighbors classification, or when the XRD patterns are similar enough between each in order to be captured by DTW the window parameter [22].

The size of the dataset is critical for obtaining a high accuracy. To explore the effect of augmented dataset size, the DNN accuracy was computed for various combinations of augmented experimental (i.e., number of augmented XRD spectra originated from the 88 measured spectra) and augmented simulated dataset sizes (i.e., number of augmented XRD spectra originating from the 96 simulated ICSD spectra). Figure 3 summarizes this sensitivity analysis for Case III training/testing conditions for space-group classification (complete with second graph for dimensionality, currently being calculated). As the size of the experimental and augmented datasets increase, the mean accuracy quickly approaches the asymptotic accuracy reported in Table 1. This trend reaffirms and quantifies the importance of data augmentation on the predictive accuracy of our model. The critical augmented-experimental dataset size seems to be around 1,000 augmented spectra, whereas the critical simulated dataset size in the absence of experimental dataset augmentation seems to be around 20,000 patterns. In other words, in our study, 1 experimental spectrum is worth approximately 20 simulated spectra. The model accuracy is more sensitive to the augmented experimental dataset size, likely because most of the dataset variance comes from the experimental XRD patterns; or, alternatively, the amount of variance introduced by Eqs 1–3 is limited. We choose a rather limited data augmentation strategy in this study based on the sound physical meaning and interpretability described in Eqs. 1–3; potentially improved predictive accuracy might be obtained employing alternative approaches to increase the DNN generalization power like regularization and dropout [27]

Figure 3 shows that if no data augmentation is used (i.e., the origin, 0, 0), and 80% of the 88 experimental spectra were used for training, our predictive accuracy was below 50%. This again reinforces the need for data augmentation with sparse datasets, as is typical with early-stage material development.

Figure 4: Simulation of the trade-off between XRD spectrum acquisition time and ML prediction accuracy. Accuracies for crystal dimensionality and space group predictions are estimated by coarsening the XRD spectrum step size (for Case 3 in Section 2.5).

To evaluate trade-offs between ML classification accuracy and XRD acquisition speed, we investigate how data coarsening of the XRD pattern impacts the accuracy of ML algorithm prediction. In Figure 4, we report Case 3 accuracy with increasing angle step size. The baseline step size of the scan in our XRD patterns is . Data coarsening is performed by removing the data with different step size and rerunning the augmentation and classification algorithms. For crystal dimensionality and space-group classification, the highest accuracies are achieved at , while 90% accuracy is achieved when the step-size is or less for both cases. Using the larger step-size, the XRD pattern acquisition time can be reduced by 75%, allowing the full spectra to be measured and classified in less than 5.5 minutes.

4 Conclusions

In this work, we develop a supervised machine learning framework to screen novel materials based on the analysis of their XRD spectra. The framework is designed specifically for cases when only sparse datasets are available, e.g., early-stage high-throughput material development and discovery loops. Specifically, we propose a physics based data augmentation method that extends small, targeted experimental and simulated datasets, and captures the possible differences between simulated XRD powder patterns and experimental thin-film XRD patterns. A few hundred augmented spectra were found to increase our classification accuracy from < 50% to 92.1% for dimensionality and 84.1% for space group.

When trained with both augmented simulated experimental XRD spectra, deep neural networks are found to have the highest accuracy among the many supervised machine learning methods studied. Also, deep neural networks can be trained and executed within minutes, being among the top three fastest studied methods ranked by evaluation time, making them suitable for deployment in accelerated materials development loops. We find that the deep neural network model tolerates coarsening of the training data, providing future opportunities for online learning, i.e., the on-the-fly adaptive adjustment of XRD measurement parameters by taking feedback from machine learning algorithms [16]. In the future, our approach may be generalized to predict other crystal descriptors such as generalized atomic coordinates or lattice symmetry.


[1] A. Tabor, D. Roch, and L. Saikin, “Lawrence Berkeley National Laboratory Recent Work Title Accelerating the discovery of materials for clean energy in the era of smart automation Publication Date,” 2018.

[2] J.-P. Correa-Baena et al., “Accelerating Materials Development via Automation, Machine Learning, and High-Performance Computing,” Joule, vol. 2, no. 8, pp. 1410–1420, Aug. 2018. [3] H. M. Rietveld, “A profile refinement method for nuclear and magnetic structures,” J. Appl. Crystallogr., vol. 2, no. 2, pp. 65–71, 1969.

[4] D. A. Carr, M. Lach-hab, S. Yang, I. I. Vaisman, and E. Blaisten-Barojas, “Machine learning approach for structure-based zeolite classification,” Microporous Mesoporous Mater., vol. 117, no. 1–2, pp. 339–349, 2009.

[5] L. A. Baumes, M. Moliner, N. Nicoloyannis, and A. Corma, “A reliable methodology for high throughput identification of a mixture of crystallographic phases from powder X-ray diffraction data,” CrystEngComm, vol. 10, no. 10, pp. 1321–1324, 2008.

[6] L. A. Baumes, M. Moliner, and A. Corma, “Design of a Full-profile-matching solution for high-throughput analysis of multiphase samples through powder X-ray diffraction,” Chem. - A Eur. J., vol. 15, no. 17, pp. 4258–4269, 2009.

[7] V. Stanev, V. V. Vesselinov, A. G. Kusne, G. Antoszewski, I. Takeuchi, and B. S. Alexandrov, “Unsupervised phase mapping of X-ray diffraction data by nonnegative matrix factorization integrated with custom clustering,” npj Comput. Mater., vol. 4, no. 1, 2018.

[8] A. G. Kusne, D. Keller, A. Anderson, A. Zaban, and I. Takeuchi, “High-throughput determination of structural phase diagram and constituent phases using GRENDEL,” Nanotechnology, vol. 26, no. 44, 2015.

[9] W. B. Park et al., “Classification of crystal structure using a convolutional neural network,” IUCrJ, vol. 4, pp. 486–494, 2017.

[10] T. Ida, M. Ando, and H. Toraya, “Extended pseudo-Voigt function for approximating the Voigt profile,” J. Appl. Crystallogr., vol. 33, no. 6, pp. 1311–1316, 2000.

[11] W. B. Park, S. P. Singh, C. Yoon, and K. S. Sohn, “Combinatorial chemistry of oxynitride phosphors and discovery of a novel phosphor for use in light emitting diodes” J. Mater. Chem. C, vol. 1, no. 9, pp. 1832–1839, 2013.

[12] V. B. Rybakov, E. V. Babaev, K. Y. Pasichnichenko, and E. J. Sonneveld, “X-ray mapping in heterocyclic design: VI. X-ray diffraction study of 3-(isonicotinoyl)-2-oxooxazolo[3,2-a]pyridine and the product of its hydrolysis,” Crystallogr. Reports, vol. 47, no. 1, pp. 473–477, 2002.

[13] N. Hirosaki, T. Takeda, S. Funahashi, and R. J. Xie, “Discovery of new nitridosilicate phosphors for solid state lighting by the single-particle-diagnosis approach,” Chem. Mater., vol. 26, no. 14, pp. 4280–4288, 2014.

[14] S. K. Suram et al., “Automated phase mapping with AgileFD and its application to light absorber discovery in the V-Mn-Nb oxide system,” ACS Comb. Sci., vol. 19, no. 1, pp. 37–46, 2017.

[15] F. Ren et al., “Accelerated discovery of metallic glasses through iteration of machine learning and high-throughput experiments,” Sci. Adv., vol. 4, no. 4, p. eaaq1566, Apr. 2018.

[16] A. Ziletti, D. Kumar, M. Scheffler, and L. M. Ghiringhelli, “Insightful classification of crystal structures using deep learning,” Nat. Commun., vol. 9, no. 1, pp. 1–10, 2018.

[17] T. W. Ke, A. S. Brewster, S. X. Yu, D. Ushizima, C. Yang, and N. K. Sauter, “A convolutional neural network-based screening tool for X-ray serial crystallography,” J. Synchrotron Radiat., vol. 25, no. 3, pp. 655–670, 2018.

[18] R. Le Bras et al., “A Computational Challenge Problem in Materials Discovery: Synthetic Problem Generator and Real-World Datasets,” Aaai, pp. 438–443, 2014.

[19] M. Järvinen, “Application of symmetrized harmonics expansion to correction of the preferred orientation effect,” J. Appl. Crystallogr., vol. 26, no. 4, pp. 525–531, 1993.

[20] P. F. Fewster, J. I. Langford, and P. F. Fewster, “Reports on Progress in Physics Related content X-ray analysis of thin films and multilayers,” 1996.

[21] Y. Iwasaki, A. G. Kusne, and I. Takeuchi, “Comparison of dissimilarity measures for cluster analysis of X-ray diffraction data from combinatorial libraries,” npj Comput. Mater., vol. 3, no. 1, pp. 1–8, 2017.

[22] A. Belkly, M. Helderman, V. L. Karen, and P. Ulkch, “New developments in the Inorganic Crystal Structure Database (ICSD): Accessibility in support of materials research and design,” Acta Crystallogr. Sect. B Struct. Sci., vol. 58, no. 3 PART 1, pp. 364–369, 2002.

[23] G. E. Eperon, S. D. Stranks, C. Menelaou, M. B. Johnston, L. M. Herz, and H. J. Snaith, “Formamidinium lead trihalide: A broadly tunable perovskite for efficient planar heterojunction solar cells,” Energy Environ. Sci., vol. 7, no. 3, pp. 982–988, Feb. 2014.

[24] M. M. Lee, J. Teuscher, T. Miyasaka, T. N. Murakami, and H. J. Snaith, “Efficient Hybrid Solar Cells Based on Meso-Superstructured Organometal Halide Perovskites,” Science, vol. 338, no. 6107, pp. 643–647, 2012.

[25] R. L. Z. Hoye et al., “Perovskite-Inspired Photovoltaic Materials: Toward Best Practices in Materials Characterization and Calculations,” Chem. Mater., vol. 29, no. 5, pp. 1964–1988, Mar. 2017.

[26] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning, vol. 1. MIT press Cambridge, 2016.

[27] F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.

[28] T. Baikie et al., “Synthesis and crystal chemistry of the hybrid perovskite (CH3NH3)PbI3 for solid-state sensitised solar cell applications,” J. Mater. Chem. A, vol. 1, no. 18, pp. 5628–5641, 2013.

[29] S. Sun, S. Tominaka, J.-H. Lee, F. Xie, P. D. Bristowe, and A. K. Cheetham, “Synthesis, crystal structure, and properties of a perovskite-related bismuth phase, (NH 4 ) 3 Bi 2 I 9,” APL Mater., vol. 4, no. 3, p. 031101, Mar. 2016.

[30] L. Etgar, “The merit of perovskite’s dimensionality; Can this replace the 3D halide perovskite?,” Energy Environ. Sci., vol. 11, no. 2, pp. 234–242, Feb. 2018.

[31] Z. Xiao, W. Meng, J. Wang, D. B. Mitzi, and Y. Yan, “Searching for promising new perovskite-based photovoltaic absorbers: the importance of electronic dimensionality,” Mater. Horiz., 2017.

[32] T. Zhang, M. Long, P. Liu, W. Xie, and J.-B. Xu, “Stable and Efficient 3D-2D Perovskite-Perovskite Planar Heterojunction Solar Cell without Organic Hole Transport Layer,” Joule, 2018.

[33] R. C. Kurchin, P. Gorai, T. Buonassisi, and V. Stevanović, “Structural and Chemical Features Giving Rise to Defect Tolerance of Binary Semiconductors,” Chem. Mater., vol. 30, no. 16, pp. 5583–5592, Aug. 2018.

[34] A. A. Coelho, “TOPAS-Academic, Version 6: Technical Reference,” p. 208, 2016.

[35] S. Kobayashi and K. Inaba, “X-ray thin-film measurement techniques,” mass Spectrosc. equipped with a Ski. interface, vol. 28, no. 1, p. 8, 2012.

[36] W. H. Press and S. A. Teukolsky, “Savitzky-Golay Smoothing Filters,” Comput. Phys., vol. 4, no. 6, p. 669, 1990.

[37] R. J. Hill and C. J. Howard, “Quantitative phase analysis from neutron powder diffraction data using the Rietveld method,” J. Appl. Crystallogr., vol. 20, no. 6, pp. 467–474, 1987.

[38] T. Degen, M. Sadki, E. Bron, U. König, and G. Nénert, “The HighScore suite,” Powder Diffr., vol. 29, no. S2, pp. S13–S18, 2014.

[39] R. E. Dinnebier, Powder diffraction : theory and practice. RSC Publ, 2009.

[40] S. Ermon et al., “Pattern Decomposition with Complex Combinatorial Constraints: Application to Materials Discovery,” 2014.

[41] J. Zhao et al., “Strained hybrid perovskite thin films and their impact on the intrinsic stability of perovskite solar cells,” Sci. Adv., vol. 3, no. 11, p. eaao5616, Nov. 2017.

[42] M. Jarvinen, “Application of symmetrized harmonics expansion to correction of the preferred orientation effect,” J. Appl. Crystallogr., vol. 26, no. pt 4, pp. 525–531, 1993.

[43] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning, vol. 1, no. 10. Springer series in statistics New York, NY, USA:, 2001.

[44] S. Salvador and P. Chan, “FastDTW : Toward Accurate Dynamic Time Warping in Linear Time and Space,” Time, vol. 11, no. 5, pp. 70–80, 2004.

[45] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,” Expert Syst. Appl., vol. 73, pp. 220–239, 2017.

[46] N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002.

[47] C. H. Yoon et al., “Unsupervised classification of single-particle X-ray diffraction snapshots by spectral clustering,” Opt. Express, vol. 19, no. 17, p. 16542, 2011.