The proliferation of data-driven science creates the need for systematically organized and machine-readable data formats.5 While in the past considerable effort has been dedicated to structuring simulation results, the organization of simulation metadata has only recently gained attention. One major aspect of such an undertaking is the classification of numerous computational models. Early forms48, 33, 52 of the categorization of quantum chemical models were based on only a few distinguishing descriptors, i.e. the treatment of electron correlation and one-particle basis set, as well as the type of Hamiltonian. More recently, projects emerged which not only collect results and metadata from output files of simulation packages but also define database schemas for their storage. For instance, the Novel Materials Discovery (NOMAD)18 repository includes a structured collection of computational model metadata as part of its metainfo component. Another example is QCSchema43 by the Molecular Science Software Institute (MolSSI), which provides software-independent data structures for quantum chemistry geared towards unified and consistent workflows. Organizing the computational models and their results can also be achieved through an ontology, often expressed in the Web Ontology Language (OWL)44. One such example is the OntoCompChem ontology38, which is applied to quantum chemistry calculations as part of the MolHub46 web service. In the domain of materials science, ontologies are more prevalent but often focus on specific subdomains such as nanoparticles53. There are, however, examples of general ontologies, such as the Elementary Multiperspective Material Ontology (EMMO)1 or the Materials Design Ontology (MDO)41. There exist a number of NIST/MGI-led prior efforts where large-scale high-throughput computational approaches have been used to screen thousands of compounds with subsequent web-based dissemination, databasing, and data-mining. 14, 31, 50, 10, 34, 18, 13, 29
Most current data structures for computational models include little information beyond just its name relying on the description in the scientific literature. Such an approach makes it difficult to construct data-driven predictions. We elaborate on the existing approaches by constructing a framework able to utilize previously obtained data (also allowing the generation of new data) to categorize the important descriptive features for a set of entities (materials, simulation workflows/models, computational methods) and target properties of interest (electronic, chemical, thermodynamic, structural properties) to construct associative maps, and organize ”actionable” data in this extremely diverse and complex domain.
Our effort follows an object-oriented approach by building a basis of unit models, which are small inseparable sets of equations pertaining to a specific physical description of reality (e.g. Kohn-Sham Density Functional Theory). A given level of theory may further be expressed as a combination of unit models. Such modularity allows us to cover a diverse range of use cases. In addition to the categorization of the computational models, we discuss a semantic layer in the form of an ontology, which not only facilitates a more accurate description of the relationships between models, but also provides the foundation for further applications (e.g. building a knowledge graph, improved search, or AI/ML engines). The design of the proposed data structures is also coupled with the application thereof inside an online software platform22 allowing to create a very short feedback loop and improve the resulting implementation based on feedback from thousands of users of the platform. Our standards facilitate the development of artificial intelligence tools that can reduce the dimensionality and complexity of the research work in materials science and chemistry, with the aim of eventually enabling inverse design. Our goal is to build collective intelligence utilizing contributions from a large audience of materials scientists (well beyond the select few experts in the computational field) and chemists in a controllable and high-level fashion.
The following section of this paper will introduce the simulation entities and the principles on which the model categorization scheme is based. The subsequent section illustrates the representation of the entities as data structures using a set of selected examples. In the fourth section, we discuss the relevancy and limitations of the categorization scheme and propose a community-driven approach to extend the categorization scheme. Finally, the principal conclusions are presented together with a perspective on future applications.
2.1 General Approach
On an abstract level, we choose to represent a computational simulation in the form of several key entities (see Fig. 1). First of all, the material entity defines the chemical composition of the system under investigation. Although this entity is termed material, it may also constitute a non-periodic molecular species. The workflow entity organizes the sequence of tasks, for instance, execution of simulation software or input/output operations. The workflow also includes the specification of the theoretical model, which in turn is represented by the model entity. It should be noted that both the workflow and model entities are composed of reusable units, which may be combined in various ways. Applying the workflow on a material in order to produce one or more properties, i.e. connecting the workflow and materials entities, is achieved by the job entity. This entity does not only serve as a container for material and workflow entities but also stores more technical information related to high-performance computing and resource allocation. Properties may either be derived from existing entities (gray) or occur as a direct result of the job (black). In the latter case, one can associate a precision entity that is derived from the workflow. An important factor in determining the precision is the practical approach for solving the theoretical model, which is represented by the method entity (M). As part of a model (or unit model), it stores the selection of algorithms, thresholds, and other practical parameters.
For the utilization on the Exabyte.io platform22 and document-based, NoSQL (not only structured query language) databases in general, these entities are formulated as database schemas. Document-based, NoSQL databases are a convenient choice for such an application due to a few advantageous properties: (a) data that is accessed together is stored together (as opposed to joined from multiple data tables), (b) the organization of data can be as complex as one chooses thus supporting parent-child hierarchical structures, (c) data structures are not fixed and may be changed in response to new data models. The proposed database schemas have been implemented in the Exabyte Source of Schemas and Examples (ESSE)21 using the JSON Schema notation (Draft-04)32.
2.2 Components and Entities
The ESSE module comprises several main schemas for simulation data, such as workflow, material, property, method, and model. The following sections focus on the latter two entities, while all other schemas are briefly summarized in Sec. 2.2.3.
The function of the model schema is to define a given computational model as accurately as possible and to simultaneously store all of the necessary metadata. We chose to represent a given computational model in terms of one or more reusable, independent components, which we will refer to as unit models in the further course. The final model which is applied to a system is a combination of said unit models and termed compound model. We define a unit model as the smallest, logically consistent set of equations or operators associated with a central property (e.g. electronic energy). In practice, a unit model may not always be unambiguously defined, i.e such a case may require further partitioning into a set of unit models that would go beyond the scope of this classification scheme. The classification, therefore, requires a subtle balance of exactness and pragmatism. The classification tree involves three main tiers (Figure 2) in order to presort the models into families of models, for instance quantum mechanical or classical. Following tier III the models are further divided into more specific categories, which are also organized hierarchically using type, subtype specifiers. The design of the schemas follows an object-oriented approach whereby schemas share fields through inheritance (by means of the allOf keyword). At each level of the classification tree, a given schema thus includes all categorization specifiers of the preceding levels and may serve as a prototype for the following level. One advantage of this concept is that changes to categories or implementation of new categories only occur locally and are propagated automatically to the lower levels.
In addition to the classification specifiers, each unit model comprises a tag field, which is also passed on through the categorization hierarchy (see Table 1 for examples). The tags describe attributes of the unit model not included in the categorization and indicate whether a modifier or augmentation has been applied to the unit model. We define a modifier as an addition to a model, which expands upon the underlying physical principle without fundamentally changing the working equations (e.g. in a linear fashion). An example of a modifier is the inclusion of an additional external potential (e.g. due to point charges). An augmentation, on the other hand, is defined as an addition to a model, which does not change the underlying physical principles of the model. For instance augmentations include acceleration techniques such as resolution-of-the-identity or localization schemes (e.g. Edmiston-Ruedenberg localization20). Apart from the specifications above, the tags field may also hold user-defined labels.
At the primary level unit models are distinguished between physics-based (pb) and statistical (st). The latter category pertains to data-driven approaches which employ statistical relations in order to predict a result. Models based on fundamental laws of physics are assigned to the physics-based category even if the unit model heavily relies on statistical elements (see also Sec. 2.3).
Within the physics-based group, the quantum mechanical (qm), atomistic (at) or mesoscopic (mes) category are used depending on the type of particle represented in the equations of the unit model. For instance, Kohn-Sham density functional theory (KS-DFT) naturally falls into the qm category due to its explicit dependence on electronic coordinates, while a force field such as CHARM2242 only depends on atomic variables and is thus assigned to at. In the statistical group a unit model falls into the probabilistic (prob) category if the predicted result incorporates some aspect of random variation, whereas the deterministic category (det) is chosen if it does not.
The quantum-mechanical models are further divided into three categories. The ab initio category (abin) comprises first-principle wavefunction models, such as Hartree-Fock theory, many-body perturbation theory, or coupled cluster theory, which do not require additional information about the system. With the electron density as the central quantum mechanical descriptor, the realizations of density functional theory are collected in the dft category. The semi-empirical category (semp) contains parametrized quantum-mechanical models, which usually only describe valence electrons for computational efficiency. As for the statistical model branch, Figure 2 shows how the deterministic models can be further subdivided using the example of three prominent machine learning model categories. Linear models (lin) span the space of models which assume a linear relationship between the input variables () and the dependent variable (
). The neural network category (nn
) contains all models that are based on a neural network architecture, i.e. a network of interconnected processing nodes whereby connections between nodes are represented by weights. The decision tree category (dtr
), on the other hand, comprises models which can generate a prediction based on recursively splitting the dataset into subsets. The outcome of such a procedure is then a linear acyclic graph of decision nodes and ’leaves’ (endpoints, which do not split the data any further). The decision tree approach is usually applied in an ensemble (random forest model), which in the CateCom scheme is represented as a compound model. Other examples of deterministic models not explicitly shown in Figure2support vector machines or clustering algorithms, such as k-means.
|relativistic||Inclusion of relativistic effects.||pb|
|user-adjustable||The model contains additional parameters to fine-tune results.||pb|
|scaling-power:||The model exhibits a formal scaling of -th power.||pb|
|self-consistent||Non-linearity in the model is solved through self-consistent optimization.||pb/qm|
|temperature||The model describes non-zero temperature effects.||pb/qm|
|excited-states||Access to electronically excited states.||pb/qm|
|spin-orbit coupling||The model accounts for spin-orbit coupling.||pb/qm|
|variational||The model follows the variational principle.||pb/qm|
|single-reference||The wavefunction is based on a single reference determinant.||pb/qm|
|multi-reference||The wavefunction is based on multiple reference determinants.||pb/qm|
|perturbative||The model contains elements of perturbation theory.||pb/qm/abin|
While the unit model pertains to the accuracy of a computational simulation, the method concerns its precision. The CateCom collection, therefore, includes a method
schema for parameters concerning the computational methodology, such as convergence thresholds or hyperparameters (machine learning). As it is closely related to a model, method schemas are part of unit models and compound models. The method schema has three main attributes: associated with method parameters, method data, and precision. Theparameters attribute holds a list of annotated control variables, which apart from the central key-value pair also encompasses a categorization keyword and, if applicable, a definition of the value’s unit. The method data attribute contains other input variables which may require additional files, such as user-generated pseudopotentials or basis sets.
In principle, the method schema contains all relevant information for the precision of a given choice of model and material. If one is able to formulate suitable scoring functions, the precision parameters can be turned into numeric features for a regression model. In conjunction with other factors, such as simulation time or memory usage, it would be highly desirable to predict an optimal model/method for a given material-property combination.
2.2.3 Other Entities
In the following, we briefly outline other notable ESSE data schemas. A more detailed definition can be found in Ref. 8. For materials data to be searchable, traceable, and reproducible, it is crucial to have a concise and informative way to describe materials and their properties. The material
schema comprises descriptive properties that uniquely specify a material, such as Bravais lattice vectors and the unit cell basis. Note that the material schema is not limited to periodic systems and will also support molecular descriptors in an upcoming future release. A workflow defines the logical composition of simulation tasks that derive from one or several simulation engines or may take other forms such as Python scripts. The workflow as we define it is also hierarchically organized in three consecutive levels (from top to bottom): workflow, subworkflow, and workflow unit. In theworkflow schema the logical composition is represented in terms of a directed acyclic graph (DAG), whereby each node is a workflow or subworkflow. A workflow may contain several subworkflows or other workflows. The organization of the simulation results is managed by the properties schema. Apart from the property data, this schema assigns a property group for easier access/findability and includes the unit of the property (if applicable).
2.3 Classification Rules
Since the classification and hierarchy of models involve some arbitrariness, we propose a set of classification rules to guide the categorization of hybrid models or edge cases. Although the rules listed below (Table 2
) only pertain to a part of the categorization tree, they are intended to demonstrate how specific cases can be distinguished and assigned to a category. Instead of aiming at a final set of categorization rules allowing to uniquely classify each computational model (and without discussing whether such an approach is even possible), we suggest the readers consider our approach below as a stepping stone toward a practically applicable implementation.
As an example, let us consider the Quantum Monte Carlo (QMC) model - an approach to find highly accurate solutions to the quantum many-body problem and is often used to study materials and molecular systems.24, 4 Due to its stochastic foundation, QMC results involve a quantifiable random error. Although stochastic sampling is an important component of the model, the aim of most QMC models is to solve for the ground state wave function (or density matrix).24 As such the model was assigned to the ab initio wave function model category (abin). This example prompts another guideline to introduce for the categorization of computational models: in case of ambiguity, one should categorize a model based on its objective (e.g. solving the Schrödinger equation) rather than its components or derivation.
Another guideline concerns the relationship of categorization tiers and explicit realizations, i.e. instances of unit models. As outlined above, tier I to tier III serve as identifiers for groups of models. To guarantee a consistent usage of the unit model object, it should thus be avoided to equate a unit model instance with one of these three tiers.
|1||pb||The model is based on physical laws.|
|1.1||pb/qm||The model depends on electronic coordinates or involves an electronic or nuclear wavefunction.|
|1.1.1||pb/qm/abin||The model is based on first-principle wavefunction approximations.|
|1.1.2||pb/qm/dft||The model is based on density functional theory.|
|1.1.3||pb/qm/semp||The model only treats valence electrons explicitly and/or involves parametrization of two-electron integrals.|
|1.2||pb/at||The model depends on atom (nuclear) coordinates only (without using wavefunctions).|
|1.3||pb/mes||The model involve a conflated representation of particles.|
|2||st||The model predicts results based on data rather than physical laws.|
|2.1||st/prob||The model involves randomness and cannot predict a result with an exact formula. Often the result is characterized by a mean and a distribution.|
|2.2||st/det||The model does not include randomness and always gives the same prediction.|
|2.2.1||st/det/lin||The model comprises a linear combination of features (or kernels).|
|2.2.2||st/det/nn||The model employs a neural network architecture.|
|2.2.3||st/det/dtr||The model is based on decision trees.|
2.4 Entity Interoperation
This section describes how the entities presented in Sec. 2.2 interact to facilitate the research process. In particular, we present the relation between workflow, compound model, and properties. Starting with the workflow, a hierarchical organization similar to the model entity is used. A subworkflow, which is associated with one application or software package, contains one or several workflow units. There are various types of workflow units each managing a different role, for instance, input/output operations or conditional operators (see Ref. 8 for a full list of types). The workflow unit type shown in Fig. 3 is of the execution type, which refers to an executable of the simulation software package. An executable may require one or several input files, which are generated by templates. The flavor entity, which is part of the workflow unit, matches templates to an executable for the purpose of obtaining a selected set of properties. In a broader sense, the flavor therefore represents a specific group of simulations, for instance, single-point calculations or geometry optimizations.
The compound model defines the model and simulation parameters for a given subworkflow. As outlined in Sec. 2.2.1 each compound model is comprised of one or more unit models. Although a unit model is only associated with one workflow unit, the opposite does not apply. Allowing one workflow unit to map to several unit models makes this framework very flexible and consistent across different simulation packages. Each unit model as well as the compound model itself contains a method object which stores information related to how a model is solved and is used to populate templates. Finally, associating properties with individual workflow units and unit models enables the user to monitor the progress of a given property across the compound model. Furthermore, since some unit models may not give rise to a certain property (cf. Unit model 3 and Property B in Fig. 3), the concept allows for convenient access to the last (or other criteria) occurrence of the property.
In the present section, we elaborate the above-introduced data structures for Unit Model, Compound Model, and Method by means of common-use examples. For the sake of brevity, only a few examples are shown and we refer the reader to the ESSE repository21 for an extensive collection of examples. It should be noted that the reoccurring key "_id" is not a component of the CateCom data structure, but pertains to the document-based database software and will thus be omitted in the discussion below.
3.1 Unit Models
Kohn-Sham Density Functional Theory (KS-DFT)28, 35 is a widely used quantum-mechanical model in material science as well as molecular science. As the solution of KS-DFT is often used as an input for a subsequent model, for instance perturbation theory26 or as an alternative reference determinant49 in ab initio wavefunction models, it is well suited to be represented as a unit model. Listing 1 shows the unit model data structure for a generalized gradient approximation (GGA) KS-DFT model using the Perdew-Burke-Ernzerhof (PBE)45 exchange-correlation functional. An example of the range-separated hybrid functional HSE0639 is presented in the appendix (see Listing A.1). According to the CateCom approach introduced in Sec. 2.2.1, the three tiers for KS-DFT are physics-based, quantum-mechanical and density functional theory, respectively. Since there exist multiple realizations of DFT (for instance orbital-free DFT57, 60), the type field further specifies the variation of DFT. In the data structure each tier (as well as type and subtype if applicable) is mapped to a simple object which contains a human-readable descriptor (name) and a machine-readable token (slug). The annotation fields introduced in Sec. 2.2.1 (augmentation, modifier and tag) give further descriptive information about the unit model and facilitate the search for unit models. In addition, references to the literature can be given using the reference field (left empty in Listing 1 for brevity). Each unit model includes a so-called flowchartId field, which is used to uniquely identify a unit model within a compound model and which serves as a reference for representing the compound model as a directed acyclic graph (DAG).
Apart from the categorization and annotation fields, the CateCom approach also supports additional fields that are exclusive to a certain unit model. For instance, the multitude of density functional approximations (DFA) warrants a separate key (termed "functional"), which captures the different categories of DFAs. The functional object contains identifier fields name and slug, which include the commonly used approximations and acronyms for DFAs. Many exchange-correlation functionals are comprised of several components (here referred to as unit functionals), for instance, separate approximations for exchange and correlation as well as a fraction of exact exchange (hybrid functionals). The functional data structure lists these contributions under the key components. Each component is characterized by nominal descriptors (name and slug), the type of functional component (vide infra) and the fraction with which the component enters the model. Besides the two unit functional types presented in Listing 1 a unit functional may adopt the type of a non-separable exchange-correlation functional (e.g. GAM61), a kinetic energy functional (e.g. Thomas-Fermi54, 23), or a non-local correlation functional such as VV1056. The method field in this example is left empty as it will be discussed separately in the next section.
As mentioned in Sec. 2.2.2, each unit model (and compound model) comprise method data that pertains to the precision of the computational simulation. For the above example of KS-DFT, this data holds, among others, information about the employed basis (e.g. plane wave energy cutoff) and integrals (e.g. k-point grid size). The method data structure is organized as follows. First, a simple categorization is given by the two required fields type and subtype. The type/subtype categorization for the methods is a preliminary solution, which -if the need arises - will be replaced by a more elaborate system comparable to the unit model categorization. In the specific example of Listing 2, the given type defines the use of plane waves in combination with pseudopotentials, while the subtype further specifies the use of ultra-soft pseudopotentials (us).
The method data structure also holds a list of non-default input variables in the parameters field. Each parameter is given in the form of an annotated key-value pair containing the name of the input variable (key) as it appears in the input file, its value, the corresponding categories (vide infra) and, if applicable, the unit of the value. As each flavor (cf. Sec. 2.4) is associated with a set of default input variables, the parameters field only needs to store input variables which deviate from the default value. The parameters in Listing 2 stem from a plane wave DFT calculation employing the Quantum ESPRESSO software package.25 In particular, they define the kinetic energy cutoff for the charge density (ecutrho) and the wavefunction (ecutwfc) as well as the approach for sampling the Brillouin-zone (occupations). Parameters labeled with the precision category are considered to influence the precision of the corresponding unit model and thus fulfill a special role. For example the precision score (cf. Sec. 2.2.2) is calculated based on these parameters. For fast access, the precision field collects the names of these input parameters. The data field stores additional data specific to the method. For example, in case of the plane-wave pseudopotential method, the data field contains the pseudopotentials themselves (pseudo). In addition, data contains a keyword (searchText) for filtering or searching the data attribute.
3.3 Compound Models
Following the definition of the CateCom schema, we illustrate its practical use by an example, which corresponds to established models for materials and molecules, respectively. In addition, the ESSE repository contains an extensive set of examples covering classical mechanics and machine learning models. Due to the variety of unit models also multi-level models, such as the combined quantum mechanical/molecular mechanical (QM/MM) approach51, can be realized as compound models.
3.3.1 DFT+GW Model
Although DFT is arguably one of the most popular electronic structure models, its deficiencies inhibit an accurate simulation of some experiments, for instance, photoemission spectroscopy.55 The GW approximation provides a way to improve upon the single-particle states obtained from DFT in a perturbative fashion.26 While the GW approximation allows for an accurate description of ”charged excitations”, i.e. electronic excitations whereby an electron is added or removed from the N-electron system, neutral excitations, which preserve the number of electrons in the system, can be described using the Bethe-Salpeter equation (BSE).9
The starting point for the GW approximation are the eigenfunctions
and eigenvaluesof the KS-DFT mean-field Hamiltonian (Hartree-Fock or DFT). The dynamically screened Coulomb potential and the single-particle energy levels obtained from GW, in turn, are input quantities for the calculation of optical excitations using the BSE. As such, the cascade of DFT, GW, and BSE are well suited to be expressed in terms of unit models.
Turning to the compound model data structure (Listing 3), a slightly different object composition can be seen. Since most of the information pertaining to the overall model is stored within the unit model data structures, the compound model does not need to repeat the information in its own data structure. Consequently, the compound model holds the arrangement of unit models (modelGraph) and a global methods object (method). The modelGraph field contains a list of nodes, each representing a unit model. Each node contains the necessary fields to construct a directed acyclic graph, i.e. a unique identifier (flowchartId), a pointer to the next object (next) and an boolean indicator of the first node (head). In addition, a human-readable (name) and a machine-readable (slug) label are included. Finally, each unit model node also maps to a workflow unit by means of the workflowUnitId key, which does not necessarily have to be unique to each node. For instance, in the above example, three unit model nodes are mapped to two workflow units. A possible scenario for this is calculating the KS-DFT solution from one software package (e.g. Quantum ESPRESSO) and subsequently applying GW and BSE using another (e.g. BerkeleyGW17). Just like the unit model, the compound model data structure also includes a method key, which refers to a global method configuration.
The CateCom approach laid out in previous sections introduces a model data structure combined with a systematic categorization of computational models. The model data structure is designed to be composed of one or more reusable components (unit models), each of which is assigned a model category. The model and other Entities representing the computational Workflow, Properties, and Materials altogether form the research-work-related data and metadata. Our approach is not meant as a final ”polished” solution, but rather as a proof-of-concept but practically deployable implementation. Admittedly, it is largely limited to physics-based models in the current implementation and is, perhaps, heavily biased toward atomistic and nanoscale simulations. The uniqueness of categorization - whether to enforce it and how - is another topic that requires further clarification. Without uniqueness, the chosen categorization scheme can be seen as one linear realization of a non-linear ”model graph”. The rest of this section discusses several important aspects of CateCom.
4.1 The Material-Model-Property Categorization
4.1.1 Material, Model, and Property relationship
For any practical applications, the fidelity of the modeling approaches can usually only be established within a certain class of materials and their associated properties. For example, it is well known that conventional Density Functional Theory significantly under-estimates the electronic band gap values in semiconductors, while providing adequate predictions for other properties such as lattice constants and/or vibrational spectra. Therefore a certain fidelity metric predicting the quality of a particular model would only stand with respect to certain material and property types.
4.1.2 Material and Property Categorization
To facilitate data-driven science, a coupled approach is needed where materials (and chemicals) have to be categorized as well as their derived properties. This way one can construct the associative relationships that can assist in identifying the most successful combinations of Materials, Workflow/Model, and Properties. Following the example in the previous section, such an approach should be able to identify, for example, that for III-V semiconductors a Model/Workflow containing both Density Functional Theory and GW Approximation provides higher fidelity than Density Functional Theory alone. Although the exact nature of such categorization is a topic of a separate discussion, and the categorization related to each of the entities can be interdependent, we still believe that starting with the model categorization provides a viable and practically useful first step.
4.2 FAIR Principles
There have been many efforts to collect and systematically organize research data to build large publicly available datasets.14, 31, 50, 10, 34, 18, 13, 29 Simultaneously, a new paradigm for scientific discovery, data-driven science, emerged, which aims to detect patterns or anomalies in these types of datasets.27, 18, 19 In a cooperative effort including representatives from academia, industry, funding agencies, and scholarly publishers, the FAIR guidelines58, 59 (findable, accessible, interoperable, and reusable) were developed in order to enhance data reusability. The following subsections demonstrate how the CateCom approach ties in with the FAIR principles.
According to the FAIR guiding principles58, findability involves the use of globally unique and persistent identifiers. As described in Sec. 3.1, each unit model in the CateCom scheme encompasses such a unique identifier in the form of the flowchartID. Additionally, unit models and method parameters are enriched with tags metadata facilitating a search or filtering for certain properties. In this way, associations between unit models can be made, which are not represented in the CateCom tree. For instance, the scaling-power-3 tag may be used to filter all unit models which formally exhibit a cubic scaling even if they are located in different categorization branches.
Interoperability encompasses the integration with other data and cross-functional cooperation with applications. The CateCom Unit Models do not possess dependencies to specific software implementations of the models, such that the models can, in principle, be associated with any software package (given that the model is implemented therein). Furthermore, storing the entities as JSON objects has the advantage that there are plenty of resources available which directly accept this format or are able to convert it to a different format (see also Sec 4.4). The software implementation mentioned in the previous section provides additional opportunities for building interoperable systems.
The CateCom scheme also addresses the reusability of data. In particular, the partitioning of models into unit models serves the purpose of reusing components that make up a model. A good example for this are many-body perturbation theory models which generally require the solution to an unperturbed Hamiltonian . Consequently, the Unit Model corresponding to the unperturbed Hamiltonian can be combined with different perturbation theory models or, in the case of Hartree-Fock theory, with post-HF wavefunction models such as configuration interaction (CI). Furthermore, the storage of the method data plays a crucial part in recording the provenance of the final property data.
4.3 Predictive AI/ML
4.3.1 Avoidance of Duplicates
With the ability to quantify and store the metadata about the digital approaches comes the ability to avoid repetition. In case a particular workflow/model/property combination, for example - pseudopotential Density Functional Theory with a certain wavefunction and charge density cutoffs - has been applied to a specific material, our proposed data management model will be able to provide a way to generate a unique fingerprint. Based on such a unique fingerprint any further duplicate attempts can be avoided leading to improved efficiency of the research work.
4.3.2 The AI ”Chemist-in-the-cloud”
The ultimate goal of the categorization described here is to enable the creation of an AI-powered digital computational chemist/materials scientist (“brain”) able to suggest the best model/method combinations for characterizing materials. The complexity of materials science and chemistry is to a large degree defined by the diversity of the problem sets and the parametric conditions associated with them. Although computational techniques have been around for over half a century, the ability to apply them successfully with high fidelity still has a significant ”art” component requiring very specialized knowledge limited to a select group of scientists only. And even this select group in practical applications often relies on their ”intuition” derived through years of experience in dealing with specific problems rather than purely deterministic. With the help of the categorization scheme proposed and provided sufficient training data based on expert decisions such intuition can be instead represented as data-driven AI/ML approaches.
4.4 Community, Ecosystem, and Future Outlook
4.4.1 The Global Digital Ecosystem for Materials R&D
The categorization framework and associated ontologies both represent fundamental critical steps in the implementation of a global digital ecosystem for materials R&D. Having data standards is an important fundamental step in the design and implementation of the data- and software infrastructure for such an ecosystem. Object-oriented design for the entities and data structures naturally enables modularity when building the software components of such ecosystem, and greatly streamlines its implementation and long-term maintenance. A version of the present categorization framework have been previously deployed as part of an online software platform 22 with applications demonstrated for multiple use cases, including metallic alloys 6, electronic properties of semiconductors 16, 15, vibrational properties of materials 7, adsorption and catalysis in zeolites12, adhesive strength of composite materials37, and beyond. Materials R&D spans a complex and multi-dimensional landscape, and requires an extremely large variety of characterization data at multiple time- and length scales. Once obtained, the data must be stored and managed in an efficient way. As more and more of materials research is performed in a way that involves digital handling of data, ontologies and categorization becomes important. This will facilitate the availability of ever increasing amounts of materials data on the web with contributions from the global community.
4.4.2 Community Contributions
Simulation scientists are able to resort to a myriad of statistical and physics-based models, whereby the number of models is constantly growing. This circumstance makes a systematic mapping of the ”model landscape” rather difficult for a small team since profound knowledge of a model is required in order to systematically arrange its properties and variants. Of course, expert knowledge is also indispensable for identifying reusable components of these models and examining edge cases that may fit more than one category. Thus, a more effective approach is to involve the community of experts directly in the maintenance and expansion of the categorization scheme. To this end, we propose to follow the collaborative strategy typical for code development platforms such as GitHub. These platforms also allow interested contributors to discuss new features (e.g. a new category) and raise issues about existing ones. In practice, a contributor first obtains their own server-side clone (’fork’) of the original repository (e.g. ESSE21, T. Bazhirov (2019)). The implementation of new features, such as a new unit model, is then carried out in a feature branch located in the cloned repository. Once the new feature is ready, the contributor then issues a request for the integration of the new feature (’pull request’) to the maintainer of the original repository. At this stage details of the new feature can be discussed and modified until the maintainer accepts the incoming changes.
4.4.3 Interfacing with other approaches
Of course, the task of developing a global digital ecosystem for materials research and development cannot be accomplished without involving a global community and interfacing with other efforts. Despite recent comprehensive approaches2, 41, it is still common for the materials science community to develop standards which are tailored to a specific sub-branch of research. Such ”artisanal”47 approach has led to several competing standards with a relatively small impact on the field. Our goal is to provide a common denominator allowing the key contributors to realize their ambitions while at the same time facilitating the level of quality required for practical real-world applications.
CateCom has a natural connection to the principle of an ontology and the similarities facilitate interoperation of CateCom with ontologies defining model classes, for the ontology-based data access (OBDA).11, 3, 36 Since ontology vocabulary is based on different formats (RDF, OWL, etc.), OBDA usually requires an access interface translating queries and responses.3, 36 Ontologies may also be used for semantic annotation, i.e. metadata enrichment.
4.4.4 Example Interfaces with other approaches
The Materials Design Ontology (MDO) defines a Computational Method class which as of this writing is limited to density functional theory (DFT) and Hartree-Fock (HF) theory. Nonetheless, the definition of these models shares commonalities with the CateCom scheme. For instance, the DFT class of MDO has a Exchange Correlation Energy Functional property implementing the most common groups of density functional approximations which are also supported by the KS-DFT unit model in CateCom. Part of the CateCom unit model (in JSON format) can thus be mapped to RDF format in order to be used with MDO. A similar mapping to the RDF format using a SPARQL-Generate script40 has been described in Ref. 41. Furthermore, a model class is also defined as part of the Elementary Multiperspective Material Ontology (EMMO).1 Although specific models (e.g. DFT) are not explicitly represented, its subclasses are similarly organized as tier I and tier II of the CateCom scheme, for instance DataBasedModel and PhysicsBasedModel exactly correspond to the st and pb categories (cf. Figure 2). Interoperation of EMMO and CateCom could, for instance, be achieved by annotation or translation as described above.
In practical terms, any interfacing most likely involves a conversion between two database schemas. The Novel Materials Discovery (NOMAD18) metainfo schema defines the structure of material-science-related data. The schema contains a very extensive list of entities and properties, such that mapping is not limited to CateCom unit models but can in principle extend to other ESSE21, T. Bazhirov (2019), Entities (e.g. Method, Property). Apart from a one-to-one mapping, one could also populate the CateCom data structures based on a descriptive string. For instance, JARVIS-DFT database13 contains a functional property, which contains enough information in order to be converted to a ksdft CateCom model object. Such object generation might not be fully complete, neglecting the method object entirely. Other approaches such as OPTIMADE2 provides a universal application programming interface (API) to access material data across several databases. The OPTIMADE specification always includes a structure attribute, whereas properties other than structural or chemical information are provider-specific. As a consequence, model-related metadata (and therefore the mapping to CateCom) may or may not be available.
4.4.5 Future Outlook
In our vision, CateCom presents a fundamental building block in facilitating mainstream data-driven research in materials science and chemicals. Our goal is to engage a large community of people possessing specialized knowledge about materials and chemicals in digital work resulting in the creation of novel AI/ML techniques. We see that community effort is critical in obtaining the ”critical mass” of data and creating network effects allowing to sustain the effort for the long term. To understand the future outlook, we draw analogies with the Computer-Aided Design and Electronic Design Automation industries. In both, as the transition from exploratory science-centric to practically applied engineering-focused research work was progressing, the number of data representation standards was consolidated to 3-5. These consolidated standards emerged behind the software development efforts amassing the largest user community - such as AutoDesk, Synopsys, Dassault, etc. We expect a similar progression of events to happen for data-driven digital materials R&D in the near future.
Apart from the general goal, there are also more technical questions that could be addressed in future work. More specifically, the current approach allows the combination of unit models to compound models without any restrictions. In order to prevent improper combinations, a strategy for interfacing unit models is needed. A potential solution would be to track input and output quantities of each model, such that combinations can be evaluated in terms of the intersection of input and output quantities. Another point concerns the representation of time-dependent simulations. Future work should thus examine whether unit models may include a ”time-propagation” operator or whether a compound model analog (e.g. ”dynamical compound model”) would be a viable option.
Regarding the categorization, it would be desirable to extend an existing ontology or create a new ontology for the multitude of computational models spanning both physics-based and data-driven models. Such an ontology would be helpful for building a knowledge graph of materials science research containing the semantic relationships between material, model, and property entities.
We introduced an approach for the categorization of computational models in conjunction with database schemas representing the Models and Methods. The proposed data-centric categorization scheme follows an object-oriented design concept, whereby a given model is expressed in terms of reusable, indivisible components (Unit Models). This modular unit model approach allows for a consistent description of model properties across software packages and is able to describe multi-level models, such as QM/MM. The data structures derived from the proposed schemas have been elucidated based on the examples, such as Density Functional Theory (DFT), GW Approximation, and similar. It has been demonstrated how the CateCom scheme complies with the FAIR guiding principles. In particular, possible mechanisms for the interoperation of CateCom with other approaches of the digital materials science ecosystem have been presented. In order to manage limit cases and guide new additions of categories, a set of categorization rules has been presented.
In its current state of development, CateCom represents a proof-of-concept with an emphasis on physics-based models. With the aim of leveraging expert knowledge, we discussed a community-driven approach for the extension of the CateCom scheme. Just as many other categorization efforts the CateCom scheme does not claim uniqueness with regards to the chosen categories. The organization of model data as presented herein allows for several convenient benefits such as transferability of a given model from one problem to another. The model categorization also allows for the generation of unique fingerprints, which facilitate the research process by avoiding duplicates. We share the current implementation of the categorization as part of an open-source online codebase21 and demonstrate some of the applications of the underlying data infrastructure in the online platform8.
The ideas expressed in the present manuscript build upon the Materials Genome Initiative30, and are designed to facilitate collaboration between materials scientists, chemists, computer/data scientists to create, deploy and analyze a set of curated methodologies to rapidly study materials at multiple time- and length scales. In our view, the present data convention is aimed to facilitate the next generation of computer-aided design tools and enables advanced R&D capabilities that facilitate the development of new kinds of products in critical industries including semiconductor, photovoltaics, energy storage, oil & gas, specialty chemicals, aerospace and automotive and others and has the potential to transform the materials sector at large.
- Elementary multiperspective material ontology. Note: Date accessed: 2021-9-16 External Links: Cited by: §1, §4.4.4.
- OPTIMADE, an API for exchanging materials data. Sci Data 8 (1), pp. 217 (en). External Links: Cited by: §4.4.3, §4.4.4.
- OntoMongo- Ontology-Based data access for NoSQL. ONTOBRAS. Cited by: §4.4.3.
- Quantum monte carlo and related approaches. Chem. Rev. 112 (1), pp. 263–288 (en). External Links: Cited by: §2.3.
- Reproducibility: seek out stronger science. Nature 537 (7622), pp. 703–704 (en). External Links: Cited by: §1.
- Large-scale high-throughput computer-aided discovery of advanced materials using cloud computing. Proceedings of the American Physical Society March Meeting 2017. External Links: Cited by: §4.4.1.
- Fast and accessible first-principles calculations of vibrational properties of materials. arxiv.org/abs/1808.10011. Cited by: §4.4.1.
- Data-centric online ecosystem for digital materials science. External Links: Cited by: §2.2.3, §2.4, §4.4.2, §4.4.4, §5.
- The Bethe-Salpeter equation formalism: from physics to chemistry. J. Phys. Chem. Lett. 11 (17), pp. 7371–7382 (en). External Links: Cited by: §3.3.1.
- The AFLOW standard for high-throughput materials science calculations. Comput. Mater. Sci. 108, pp. 233–238. External Links: Cited by: §1, §4.2.
- Ontology-based database access. In SEBD, pp. 324–331. Cited by: §4.4.3.
- Computing rpa adsorption enthalpies by machine learning thermodynamic perturbation theory. Journal of Chemical Theory and Computation 15 (11), pp. 6333–6342. External Links: Cited by: §4.4.1.
- The joint automated repository for various integrated simulations (JARVIS) for data-driven materials design. npj Computational Materials 6 (1), pp. 1–13 (en). External Links: Cited by: §1, §4.2, §4.4.4.
- AFLOW: an automatic framework for high-throughput materials discovery. Comput. Mater. Sci. 58, pp. 218–226. External Links: Cited by: §1, §4.2.
- Electronic properties of binary compounds with high throughput and high fidelity. arxiv.org/abs/1808.05325. Cited by: §4.4.1.
- Accessible computational materials design with high fidelity and high throughput. arxiv.org/abs/1807.05623. Cited by: §4.4.1.
- BerkeleyGW: a massively parallel computer package for the calculation of the quasiparticle and optical properties of materials and nanostructures. Comput. Phys. Commun. 183 (6), pp. 1269–1289. External Links: Cited by: §3.3.1.
- NOMAD: the FAIR concept for big data-driven materials science. MRS Bull. 43 (9), pp. 676–682. External Links: Cited by: §1, §4.2, §4.4.4.
- Big Data-Driven materials science and its FAIR data infrastructure. Handbook of Materials Modeling, pp. 1–25. External Links: Cited by: §4.2.
- Localized atomic and molecular orbitals. Rev. Mod. Phys. 35 (3), pp. 457–464. External Links: Cited by: §2.2.1.
-  (2021) Exabyte source of schemas and examples. Note: Date accessed: 2021-9-16 External Links: Cited by: §2.1, §3, §4.2.2, §4.4.2, §4.4.4, §5.
-  (2015) Exabyte.io. Note: Date accessed: 2021-9-24 External Links: Cited by: §1, §2.1, §4.4.1.
- Eine statistische methode zur bestimmung einiger eigenschaften des atoms und ihre anwendung auf die theorie des periodischen systems der elemente. Zeitschrift für Physik 48 (1), pp. 73–79. External Links: Cited by: §3.1.
- Quantum monte carlo simulations of solids. Rev. Mod. Phys. 73 (1), pp. 33–83. External Links: Cited by: §2.3.
- Quantum ESPRESSO toward the exascale. J. Chem. Phys. 152 (15), pp. 154105 (en). External Links: Cited by: §3.2.
- The GW compendium: a practical guide to theoretical photoemission spectroscopy. Front Chem 7, pp. 377 (en). External Links: Cited by: §3.1, §3.3.1.
- The fourth paradigm: data-intensive scientific discovery. Vol. 1, Microsoft research Redmond, WA. Cited by: §4.2.
- Inhomogeneous electron gas. Phys. Rev. 136 (3B), pp. B864–B871. External Links: Cited by: §3.1.
- AiiDA 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance. Sci Data 7 (1), pp. 300 (en). External Links: Cited by: §1, §4.2.
- The materials genome initiative. Note: Date accessed: 2021-9-25 External Links: Cited by: §5.
- Commentary: the materials project: a materials genome approach to accelerating materials innovation. APL Materials 1 (1), pp. 011002. External Links: Cited by: §1, §4.2.
-  (2017) JSON schema. Note: Date accessed: 2021-9-16 External Links: Cited by: §2.1.
- Three-dimensional “pople diagram”. J. Phys. Chem. 94 (14), pp. 5435–5436 (en). External Links: Cited by: §1.
- The open quantum materials database (OQMD): assessing the accuracy of DFT formation energies. npj Computational Materials 1 (1), pp. 1–15 (en). External Links: Cited by: §1, §4.2.
- Self-Consistent equations including exchange and correlation effects. Phys. Rev. 140 (4A), pp. A1133–A1138. External Links: Cited by: §3.1.
- Ontology-Based approaches to big data analytics. In Hard and Soft Computing for Artificial Intelligence, Multimedia and Security, pp. 355–365. External Links: Cited by: §4.4.3.
- Evaluation of the mechanical properties of carbon fiber/polymer resin interfaces by molecular simulation. Advanced Composite Materials 28 (6), pp. 639–652. External Links: Cited by: §4.4.1.
- An ontology and semantic web service for quantum chemistry calculations. J. Chem. Inf. Model. 59 (7), pp. 3154–3165 (en). External Links: Cited by: §1.
- Influence of the exchange screening parameter on the performance of screened hybrid functionals. J. Chem. Phys. 125 (22), pp. 224106 (en). External Links: Cited by: §3.1.
- A SPARQL extension for generating RDF from heterogeneous formats. In The Semantic Web, pp. 35–50. External Links: Cited by: §4.4.4.
- An ontology for the materials design domain. In The Semantic Web – ISWC 2020International Semantic Web Conference, Lecture Notes in Computer Science, Cham, Switzerland, pp. 212–227 (en). External Links: Cited by: §1, §4.4.3, §4.4.4.
- All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B 102 (18), pp. 3586–3616 (en). External Links: Cited by: item Tier II.
-  MolSSI/qcschema. Note: Date accessed: 2021-9-16 External Links: Cited by: §1.
-  (20122012) OWL 2 web ontology language document overview (second edition). Note: https://www.w3.org/TR/owl2-overview/Accessed: 2021-9-16 Cited by: §1.
- Generalized gradient approximation made simple. Phys. Rev. Lett. 77 (18), pp. 3865–3868 (en). External Links: Cited by: §3.1.
- The semantics of chemical markup language (CML) for computational chemistry : CompChem. J. Cheminform. 4 (1), pp. 15 (en). External Links: Cited by: §1.
- AiiDA: automated interactive infrastructure and database for computational science. Comput. Mater. Sci. 111, pp. 218–230. External Links: Cited by: §4.4.3.
- Two‐Dimensional chart of quantum chemistry. J. Chem. Phys. 43 (10), pp. S229–S230. External Links: Cited by: §1.
- Third-Order Møller-Plesset theory made more useful? the role of density functional theory orbitals. J. Chem. Theory Comput. 16 (12), pp. 7473–7489 (en). External Links: Cited by: §3.1.
- Materials design and discovery with High-Throughput density functional theory: the open quantum materials database (OQMD). JOM 65 (11), pp. 1501–1509. External Links: Cited by: §1, §4.2.
- QM/MM methods for biomolecular systems. Angew. Chem. Int. Ed Engl. 48 (7), pp. 1198–1229 (en). External Links: Cited by: §3.3.
- Anatomy of relativistic energy corrections in light molecular systems. Mol. Phys. 99 (21), pp. 1769–1794. External Links: Cited by: §1.
- NanoParticle ontology for cancer nanotechnology research. J. Biomed. Inform. 44 (1), pp. 59–74 (en). External Links: Cited by: §1.
- The calculation of atomic fields. Math. Proc. Cambridge Philos. Soc. 23 (5), pp. 542–548. External Links: Cited by: §3.1.
- Quasiparticle self-consistent GW theory. Phys. Rev. Lett. 96 (22), pp. 226402 (en). External Links: Cited by: §3.3.1.
- Nonlocal van der waals density functional: the simpler the better. J. Chem. Phys. 133 (24), pp. 244103 (en). External Links: Cited by: §3.1.
- Recent progress in orbital-free density functional theory. World Scientific (en). External Links: Cited by: §3.1.
- The FAIR guiding principles for scientific data management and stewardship. Sci Data 3, pp. 160018 (en). External Links: Cited by: §4.2.1, §4.2.
- A design framework and exemplar metrics for FAIRness. Sci Data 5, pp. 180118 (en). External Links: Cited by: §4.2.
- Orbital-free density functional theory for materials research. J. Mater. Res. 33 (7), pp. 777–795. External Links: Cited by: §3.1.
- Nonseparable exchange-correlation functional for molecules, including homogeneous catalysis involving transition metals. Phys. Chem. Chem. Phys. 17 (18), pp. 12146–12160 (en). External Links: Cited by: §3.1.