Machine learning (ML) has seen lots of successful applications in the fields of chemical informatics and electronic structure theory in recent years[Botu2019, Kitchin2018]
. Scientists have applied various fingerprinting approaches to describe atoms or their associated electronic environments in molecular systems, and trained ML models that connects these high-dimensional feature vectors to some properties of interest.
One challenge that emerged from these studies is the lack of connection between these ML inputs and the underlying molecular systems. More specifically, with a set of feature vectors, there are no current tools for scientists to easily locate the corresponding atoms or electronic environments in the molecular systems. This is an issue for chemists, who typically develop strong “chemical intuition” through years of research and development experience. This intuition is based on the location and context of atoms or electrons in molecular systems, not on the derived mathematical features of ML models. The lack of connection prevents chemists from understanding the features intuitively. This makes it difficult for them to deduce the reason behind unsuccessful fingerprinting systems, or to improve them based on their chemical knowledge. Currently, researchers work around this mostly by trial and error: keep trying new fingerprinting system until the accuracy of the model provides indirect evidence that it works. While some systematic approaches have emerged [Imbalzano_2018], they are still based on enumeration of many possible candidates and do not enable chemists to apply their domain knowledge to select or analyze the resulting fingerprints.
For example, a researcher might want to select features that differentiate the electronic structure of a bond between carbon (C) and nitrogen (N). This could be achieved by analyzing the electronic structure of a CHNO molecule as shown on the left of Fig. ElectroLens: Understanding Atomistic Simulations through Spatially-Resolved Visualization of High-Dimensional Features along with features derived using convolutional kernels [MLXC2019] (right). By identifying and selecting sub-sets of these features the researcher can learn that when the combination of two features is within certain window (shown by the selection in the bottom right) the C-N bond is selected. This enables the researcher to identify this as a critical fingerprint in subsequent machine-learning models.
Visualization has strong potential to solve this problem, enabling users to not only look at the high-dimensional feature vectors, but also establish the connections between them and their corresponding regions in the actual system. The connection will help researchers link the features to their chemical domain knowledge for easier understanding. This is highlighted by the example above, where a researcher can use visualization to establish an connection between the numerical range of certain features and chemical concepts like C-N bonding, which can facilitate feature engineering and model improvement. Moreover, by considering the prediction accuracy of a ML model as an additional feature, the same tool can be used to directly visualize the performance of the model, identify problematic regions, and improve the model accordingly.
Contributions. In this study, we work with experts in ML models for electronic structure theory and contribute:
ElectroLens, a 3D visualization tool for high-dimensional spatially-resolved features associated with atoms and electronic environments of molecular systems. The UI of ElectroLens, shown in ElectroLens: Understanding Atomistic Simulations through Spatially-Resolved Visualization of High-Dimensional Features, consists of two parts: the 3D view(s) (left) corresponding to Cartesian space, and 2D plots (right) corresponding to projections of the feature space.
Interactive visualization design that displays high dimensional data through a series of 3D and 2D plots and connects them via interactive selection (section 4). ElectroLens allows the user to make selections on the 2D plots, and the corresponding regions in real space will be highlighted in the 3D view.
Scalable implementation that can process and visualize more than one million data points at a high frame rate of 60 FPS, on a commodity laptop computer.
An open-source desktop application with Python bindings to the Atomic Simulation Environment (ASE) library [ase-paper] that is commonly used by scientists for managing atomistic simulations. ElectroLens is currently hosted on Github at medford-group.github.io/ElectroLens/. Tutorials and documentation are provided on the website.
Usage scenarios illustrate how ElectroLens helped to decipher new descriptor systems and ML model performance in our research on ML models for electronic and atomistic structures.
Fingerprinting approaches are used to construct descriptive features to capture different aspects of molecular systems in the communities of ML electronic structure theory and chemical informatics. One common approach is to fingerprint molecular structures, leading to atom-centered features that capture the local chemical environment of a specific atom. There are numerous schemes including overlap integrals of Gaussians or atomic orbitals [BehlerParrinello, BartokSOAP2013], Zernike polynomials [KHORSHIDI2016310], and others [Chandrasekaran2019]. The result of these schemes is a high (typically 10) dimensional vector that describes each atom, which can be described as vectors corresponding to points on a irregular grid. An alternative approach is to examine the electronic structure surrounding the atoms instead of the atoms themselves. One strategy to achieve this is to treat the electron cloud as a voxelized 3D image, and to construct features to describe each voxel. For example, the recently-developed MCSH descriptors work in this way [MLXC2019]. In this case, the resulting feature vectors correspond to points on a regular grid. In both cases the problem involves high-dimensional features that correspond to points in 3D Cartesian space.
The existing fingerprinting systems have been the foundation of numerous successful ML models for predicting the energies from atomic configurations [Botu2019, Behler2016], or energy contributions from local electronic structure [MLXC2019]
. However, there is a significant challenge in understanding the physical or chemical meaning of the features which are typically based on mathematical transformations rather than physical derivations. This problem will likely increase with the rise of deep learning techniques where the features are determined by the algorithm, making it even harder to assign specific meaning. Visualization provides a promising route to gain intuition about big, high-dimensional data sets and corresponding ML models[7784854, Hohman2018]. While there are numerous software packages available for visualizing atomistic data sets [HUMP96, VESTA32011, MoldenPaper1, OpenStructure, MView, Jmol, Pymol, Tomviz], they have all been developed with chemical properties in mind, and are optimized for visualizing chemical concepts such as atom types, chemical bonds, and electron density. However, none are optimized to visualize the high-dimensional features used in ML models. Most programs are limited to visualizing a maximum of 2 features other than position (typically element type and radius) with a single 3D view, and often become sluggish when visualizing more than ten thousand data points. The data sizes required for ML models typically exceed this limit in terms of number of data points and dimensions per point.
3 Design Challenge
To establish an intuitive link between feature vectors and actual systems, we have worked with domain experts to identify six design challenges that are currently not addressed by other tools in the electronic structure theory or chemical informatics communities.
Visualizing high-dimensional feature vectors. The typical fingerprinting systems used in research generates tens or even hundreds of features per atom or grid point. Visualization of the distribution and correlations between these features is crucial to the understanding of their meaning.
Connecting features and corresponding Cartesian space. It is critical to visualize the features in their corresponding chemical context. A seamless connection between these alternative representations of the system is important to grasp the meanings of the features. Unfortunately, there is no existing tool in the community that establishes these connections visually.
Comparing features across different systems. Assessing the generality of a given relationship between a feature vector and a chemical system requires comparison of multiple systems simultaneously. For example, if a feature is found to correspond to the C=O bonding region in a CO molecule, it would be of interest to know if the same feature also corresponds to the C=O bond in a similar molecule such as HCOOH. However, most existing visualization tools for atomic/electronic structure focus on visualizing a single system at a time.
Visualizing atomistic and electronic structures simultaneously. The atomic structure and electronic structure are intimately linked, but are represented with different data structures: irregular grids for atoms, and regular grids for electronic environments. The ability to simultaneously visualize the two data types, along with features that describe environments within these two related structures, will provide a new route to building intuitive connections between the two representations.
Handling large datasets. Datasets used for ML training are often very large, containing millions of data points with tens of dimensions or more. Most tools for visualizing atomistic and electronic structure data do not scale well to datasets of extreme size. Building intuition requires interactive visualization of such large datasets, so fast rendering speed is critical.
Integrating with existing infrastructure. Simulations of atomistic and electronic structure involve a wide range of tools and software packages, including molecular dynamics, density functional theory, and wavefunction theory codes. These tools have a range of input/output data structures, so it is important for visualization tools to integrate with existing software packages that already act as API’s for integrating the tools.
4.1 Main Views and Interactions
3D View. The 3D view corresponds to physical systems in Cartesian space. For atomistic simulations, the physical information can take two forms: atom positions (irregular grids) or electron/wavefunction densities (regular grids). ElectroLens can visualize these distinct data types separately or simultaneously, as shown in Figs. 1-3. For the atomic features, ElectroLens uses the common “ball-and-stick” model to display the location of the atoms and bonds. The size and color of the “balls” can be used to code 2 features of interest, which are assumed to be the atomic type by default, but could be other features. For the electronic environment features, ElectroLens renders this data as a point cloud to mimic the electron cloud, where the density of points corresponds to the density of electron cloud, and the color can be used to encode an extra feature. The view is highly customizable for users to adjust visualization and highlight regions of interest. For example, users can adjust the size, transparency, color scale, color map, or density of the point cloud with the option box of the view. On the other hand, one could also change the resolution, color, size of the ball-and-stick model and more as well. The view also supports “slicing”, allowing users to look at cross sections of the electron density.
2D Plots and Selection. The 3D view is limited to visualizing 2 spatially-resolved features through double-encoding on size/color (ball and stick) or density/color (point clouds). To display additional features without loss of information [C1], a series of connected 2D plots are used. Three kinds of 2D plots are implemented: scatter plots, correlation plots and dimension-reduction plots. The scatter plots are heat maps where two features of choice are plotted and the color of the heat map corresponds to the amount of points at a given region of feature space. The correlation plot is a visual representation of the correlation matrix of the features. This provides a fast way to explore the high-dimensional feature space and identify interesting combinations, since feature combinations with low correlation tend to contain the most information. The dimension-reduction plots provide an alternate strategy for navigating the high-dimensional space. They utilize principal component analyis (PCA) reduce the dimensionality of the features and plot the resulting projection using a scatter plot. These 2D plots can be added to the right-hand view on the fly with common transformations like “log10” for better visualization. Users can also choose to simultaneously plot features corresponding to atomic and electronic environments to assess connections between the two [C4]. One key functionality of the scatter plots and dimension-reduction plots is the ability to select regions in the 2D plots and see how these features are localized in 3D Cartesian space [C2], as highlighted in Fig. ElectroLens: Understanding Atomistic Simulations through Spatially-Resolved Visualization of High-Dimensional Features. ElectroLens supports exporting these sub-sets of points for further analysis, as well as saving specific views that can be loaded in later or shared with other users. Users can customize the 2D plots (color map, scatter size, etc.) as well.
Multiple 3D System Views. ElectroLens allows user to visualize and compare an arbitrary number of systems simultaneously, and use the 2D plots to plot the pooled data across all systems (Fig. 1) to understand the generality of connections between feature vectors and chemical concepts . In other words, points in the 2D plots may correspond to regions in multiple systems. When the points are selected in the 2D view, the corresponding regions in all systems where the points appear will be highlighted, as illustrated in Fig. 2.
Python and ASE bindings. Many researchers who conduct atomistic simulations use Python to setup simulations and analyze results. Therefore, we have also created Python bindings for ElectroLens by wrapping with the CEFpython library [Tomczak2019cefpython]. The bindings are designed to follow the structure of the ASE Python library [ase-paper], which is widely used by researchers in the field and contains classes corresponding to common data structures like atom positions . ASE also contains numerous translators for reading/writing a range of different file types into standardized data structures. By leveraging Python and ASE, ElectroLens is compatible with a wide range of file formats, and easy to use for anyone familiar with ASE.
4.2 Case Studies
This section presents two example applications of ElectroLens, presenting how it facilitated and accelerated research in ML and chemistry. They are: (1) Diagnosing the failure of a neural network for predicting exchange-correlation energy from electron density (Sect.4.2.1) and (2) Identifying atomic configurations corresponding to a failing machine-learned force field (Sect. 4.2.2).
4.2.1 Constructing ML models from electron density
ElectroLens was originally developed to support a research project to establish a ML model to predict the exchange-correlation energy, a key quantity needed in density functional theory (DFT) [MLXC2019]. Briefly, DFT is a widely-used technique for simulating the electronic structure of molecular systems. The formalism is based on a powerful theorem proving a one-to-one mapping between electron density and ground-state energy [doi:10.1063/1.4869598, RevModPhys.87.897, PhysRev.136.B864]. However, the connection to one part of the energy, known as the exchange-correlation (xc) energy, is unknown. Our work focused on using neural network (NN) to empirically learn this connection. As a first step, a NN model was trained between the local electron density at a point and the corresponding xc energy at the same point [HandyAndTozer]. However, the resulting accuracy was insufficient, and ElectroLens was used to diagnose this result.
Connecting chemistry to features. We diagnosed this problem by plotting the NN prediction error as a function of the electron density that was the input of the NN. This revealed a multi-valued error distribution with three distinct “tails” (Fig. 1, upper right). This explains the poor accuracy of the model because multi-valued functions can not be regressed, and implies there are electronic environments that can not be distinguished by electron density alone. However, the chemical nature of the problematic environments was not immediately clear. We selected several sample systems (CO, NO, and HCOOH) and utilized ElectroLens to simultaneously visualize the electronic environments and atom positions (Fig.1 left panels). Through interactive selection, ElectroLens revealed that the tails correspond to the atomic core regions for C, N, and O atoms, as shown in Fig. 2. This clearly illustrated that the NN model was failing due to the electron density in the atomic core regions.
Understanding relationships between features. To further diagnose the issue, additional features were selected to distinguish the C, N, and O core regions. Prior models utilize the derivative of the electron density, so this was one of the features we investigated. We used ElectroLens to plot the derivative as a function of the electron density and observed a similar three-tailed structure, as shown in Fig. 1 (bottom right). ElectroLens was used to show that these tails corresponded to the tails in the error distribution (Fig. 2). This provided the insight that the derivative of the electron density is capable of distinguishing the C, N, and O core regions, and led to a new and more accurate NN model based on the density and its derivative. This evolved into the development of new rotationally-invariant versions of the derivatives based on convolutions with MCSH kernels, ultimately improving the accuracy of the models even further [MLXC2019].
From this case study, two features have proven to be crucial: (1) The ability to simultaneously compare across different systems; (2) The ability to interact smoothly (60 fps) with a large (1 million 47-dimensional points) data set.
4.2.2 Assessing ML force-fields based on atomic positions
The construction of ML force-fields for predicting energies and forces directly from atomic structures is a rapidly growing sub-field of computational chemistry [BehlerReview2015, Kitchin2018, Botu2019, Ying2017]
. Briefly, these methods take atomic positions as inputs, then use symmetry-preserving fingerprints to distinguish the chemical environments experienced by each atom. These fingerprints are used as inputs to ML models, typically NNs, and trained to rapidly and accurately reproduce forces and energies of each atom using training data generated with expensive quantum-mechanical simulations. However, the black-box nature of the process makes it difficult to diagnose failures. In our research group we commonly experience a specific failure mode where a model is accurate for training data, but exhibits erroneously large forces when applied in a molecular dynamics simulation. This occurs because some atoms in the system move into chemical environments that were not sampled in the training data, leading to inaccurate results. However, it can be particulary difficult to pinpoint exactly which atoms are outliers, making it challenging to identify new training systems that will improve the model.
View customization and double encoding. ElectroLens supports encoding features or properties such as the force magnitude in the 3D visualization frame. This enables views like the one shown in Fig. 3, where atoms with excessively large forces can be easily identified, as they are colored by the extreme ends of the chosen color map. This can be used to construct new training data that contains atomic environments similar to those with large errors.
Interactive selection with atomistic features. In addition to visualizing errors in the 3D frame, it is also possible to plot them in the 2D frames, along with other input features to the model. This provides a route to identify regions of the feature space that have not been sampled, facilitating the generation of new training data or implementation of methods to raise errors when the model moves outside the domain of the training data.
The main lesson learned from this study is that the use of double-encoding features using color or size can assist researchers in identifying interesting patterns. This is not standard in visualizing chemical data, since size and color are typically determined by the element.
5 Conclusion and Future Work
A new visualization tool called ElectroLens has been developed to analyze high-dimensional features derived from 3D datasets. The tool has been developed in the context of atomic and electronic structure data, corresponding to two different representations (irregular and regular grids). ElectroLens is built to be efficient with large datasets easy to use and integrate with existing infrastructure for atomstic simulations. ElectroLens was applied and tested in multiple scenarios, leading to improved ML models for exchange-correlation energies and atomistic force fields (Sect. 4.2). Future work includes implementing other projections into ElectroLens, as well as studying methods to automatically infer informative feature combinations. We also plan to focus on applications of the tool to the increasingly popular NN force-fields for molecular dynamics simulations [BehlerReview2015, Kitchin2018, Botu2019, Ying2017], including improved handling of time-dependent data sets. Further, we expect that ElectroLens may be useful for a wider set of problems in chemistry, physics, and engineering involving spatially-resolved high-dimensional data.