EngMeta – Metadata for Computational Engineering

05/04/2020
by   Björn Schembera, et al.
HLRS
0

Computational engineering generates knowledge through the analysis and interpretation of research data, which is produced by computer simulation. Supercomputers produce huge amounts of research data. To address a research question, a lot of simulations are run over a large parameter space. Therefore, handling this data and keeping an overview becomes a challenge. Data documentation is mostly handled by file and folder names in inflexible file systems, making it almost impossible for data to be findable, accessible, interopable and hence reusable. To enable and improve a structured documentation of research data from computational engineering, we developed EngMeta as a metadata model. We built this model by incorporating existing standards for general descriptive and technical information and adding metadata fields for disciplinespecific information like the components and parameters of the simulated target system and information about the research process like the used methods, software and computational environment. EngMeta functions, in practical use, as the descriptive core for an institutional repository. In order to reduce the burden of description on scientists, we have developed an approach for automatically extracting metadata information from the output and log files of computer simulations. Through a qualitative analysis, we show that EngMeta fulfills the criteria of a good metadata model. Through a quantitative survey, we can show that it meets the needs of engineering scientists.

READ FULL TEXT VIEW PDF

Authors

page 6

page 16

04/23/2018

Forensic Analysis of the exFAT artefacts

Although keeping some basic concepts inherited from FAT32, the exFAT fil...
11/24/2021

Citation method, please? A case study in astrophysics

Software citation has accelerated in astrophysics in the past decade, re...
04/20/2022

MEDFORD: A human and machine readable metadata markup language

Reproducibility of research is essential for science. However, in the wa...
08/17/2018

The Variable Quality of Metadata About Biological Samples Used in Biomedical Experiments

We present an analytical study of the quality of metadata about samples ...
02/27/2020

Dataset Search In Biodiversity Research: Do Metadata In Data Repositories Reflect Scholarly Information Needs?

The increasing amount of research data provides the opportunity to link ...
09/17/2020

Building Containerized Environments for Reproducibility and Traceability of Scientific Workflows

Scientists rely on simulations to study natural phenomena. Trusting the ...
09/08/2021

Knowledge Learning-based Adaptable System for Sensitive Information Identification and Handling

Diagnostic data such as logs and memory dumps from production systems ar...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Scientific workflow in computational engineering.

The aim of computational engineering is the analysis of engineering problems with the help of numerical simulations. As an example of molecular dynamics (one field of computational engineering), a model of the trajectories of molecular systems can be used for the simulation of nanotubes, which have various application areas, e.g. medical applications. These simulations are performed on big computing systems, supercomputers or clusters.

Typically, the scientific workflow in computational engineering is as depicted in figure 1: In the first phase of data production, the simulation run is defined in a job file, submitted to a scheduler and then executed on the compute nodes. During the runtime of the simulation, lots of data and additional output files (such as log-files) are written to the parallel file system. After the production phase, the raw data produced in one or several simulations is prepared and analyzed in the data evaluation phase. When the analysis of the data is completed, the researchers interpret the results by visualizing and drawing conclusions from them, resulting in the last phase, where the results are disseminated through scientific papers, which is the publication phase. Not every simulation results directly in a scientific publication. The massive increase in computational power has led to additional accuracy but also additional complexity of the simulations. Much more parameters have to be tested and a lot of simulations are run merely to test the computational set-up. Therefore, there is an increasing demand to manage the associated data.

During the whole workflow, the shape of research data management is quite poor  Schembera and Bönisch (2017). Especially good data documentation is missing, which allows the produced research data to be made FAIR  Wilkinson et al. (2016) as is the overall goal of research data management. Even though a lot of metadata models exist in general, none of them is suitable for the use-case of computational engineering, which is why we developed EngMeta  Schembera and Iglezakis (2019) as a tailored model of description for this area. The development was a joint effort of the University Library of Stuttgart and the High Performance Computing Center Stuttgart together with the Institute of Thermodynamics and Thermal Process Engineering and the Institute of Aerodynamics and Gas Dynamics of the University of Stuttgart. Since the first version of the metadata model, many refinements have been made. The model is implemented in a data repository and the structured result of an automated metadata extraction of simulation files. These three topics – an extended view on EngMeta, the automated metadata extraction and the integration in a repository – are covered in this paper together with a qualitative and quantitive evaluation.

2 Requirements and Related Work for a Metadata Model for Engineering Applications

A relevant data description needs to include the features that allow the data to be findable, understandable and replicable in a discipline-specific context.

For finding the data, metadata has to include information beyond the classic standard descriptive metadata. Discipline-specific search criteria need to be included, such as information about the target system of the simulation, the variables, parameters and methods used, as well as the spatial or temporal resolution.

Understanding and hence reusing the data is a matter of information on the used software, the computational environment as well as on the encoding and on the format. Only with this information included in the data documentation, can a researcher fully grasp what has been done and how the results of the research have been produced. This is an important requirement in terms of making science reproducible, and therefore making it transparent.

For the replication of the data, data provenance is an important facet of the metadata. Information about every processing step has to be included and technical metadata has to be added.

There are existing metadata models and standards for data in general like DataCite DataCite (2017), the schema underlying the metadata for DOIs or vocabularies like the W3C recommendation DCAT Erickson and Maali (2014). In these standards there are description categories for citation data (title, author, publisher, dates), for subject indexing (subject, keywords) and for usage information (rights, licence, data type). However, these standards do not address the specifics of the engineering domain.

Discipline-specific models for computational engineering are hard to find. The Chemical Markup Language CML111https://www.xml-cml.org/, last checked June 5th 2019. and especially its extension CMLComp222http://homepages.see.leeds.ac.uk/~earawa/CMLComp/index.html, last checked June 6th 2019. offers one approach for computational chemistry and the simulation of molecules at the atomic scale. Even though some relevant elements, such as the computational environment, are captured by the model, it is far too specific to computational chemistry. The Molecular Simulation Markup Language (MSML) Grunzke et al. (2014) builds on CML but with an extended focus on molecular simulations. This metadata description is embedded into the MoSGrid system and serves as a workflow definition language for running the simulation and describing the outputs. Hence, a lot of manual work is involved, and the description is both specific to the workflow system and to molecular simulation.

What is missing in all these schemes are discipline-specific descriptions of the observed system and parameters of the observation itself as well as a possibility to track the provenance of the data with all relevant methods, utilities and parameters.

There are disciplines with elaborated and accepted metadata standards like the DDI standard for the social sciences Green and Humphrey (2013) and the CERA-2.5 Lautenschlager et al. (1998) scheme as a data-centric metadata model for climate research. The CERA model originates from 1998, being characteristic of the early commitment of the climate sciences for research data management and data description. The model incorporates discipline-specific metadata, such as the coverage of a climate phenomenon, with descriptive and process information.

3 The Metadata Model as a Core

Our process of deriving the metadata model started with considerations on the requirements of engineering researchers  Iglezakis and Schembera (2018): What information is important when trying to find, understand and replicate engineering data? In the next steps, we built an object model to represent this information.

3.1 Object Model

Figure 2: Model of the Objects underlying EngMeta

The object model, depicted in figure 4, is the very first step of developing a metadata model. It builds a common ground of understanding for all the relevant objects to incorporate into the metadata model. We developed this object model with the researchers from computational engineering by analyzing their scientific workflow. Even though the researchers came from distinct subject areas, they both use computer simulations to investigate their research questions. Certain entities are relevant for both subjects and are representative for the whole of computational engineering research.

In simulation science, after the simulation run is finished, a dataset is created and marks the result data. The data set represents the simulated target system or observed system as an entity in the object model. The observed system is usually characterized by controlled and measured variables and parameters, consists of components and is defined by boundary conditions. In thermodynamics, the components are molecules with force fields acting as relevant entities in the description of research data. The observation or simulation itself has a temporal and spatial resolution. Moreover, the simulation method is important to researchers for the understanding of the data. These entities form the discipline-specific metadata.

The data set has been generated or processed in a processing step. A processing step represents, for example, a simulation run, a post processing step or an analysis. The step is done within a computational environment. This environment entity may hold information on the used hardware. Moreover, the software used for the processing step builds its own entity, since it is critical for understanding the conducted research. The information from these entities together with the actors of the steps constitute the process metadata.

The data set may consist of multiple files with file attributes like name and type and may be equipped with a PID and a checksum per file. These entities compose the technical metadata.

Moreover, the additional entities describe the data set from a descriptive point of view. These entities include related publications, relations to other data objects representing the context, a funding reference, the related project, a license, a title, a date, keywords, a description, related persons as well as the subject area of the research. Moreover, a worked entity is needed to document failed simulation runs. These entities form the descriptive metadata.

We decided to build a data centric metadata model, so the data set marks the central entity in the object model. The rationale behind this is that the interpretation of the data set is at the center of scientific reasoning in simulation science. When the simulation is done, the data set is analyzed with different methods resulting in new data sets. The processing step is needed as an entity to model various simulation runs or the analysis steps within one simulation project. Although we focused on data from simulations, EngMeta is also suitable for the description of experimental data.

3.2 The Metadata Core

Figure 3: Components and Fields of the EngMeta Metadata Model

After being clear about the objects to be incorporated to build a relevant metadata model, we checked whether existing metadata models fit our needs. To our best knowledge, none of the existing metadata standards include all the parts of our object model. Existing metadata standards such as PREMIS or DataCite were too general, Codemeta (Software) and ExptML (Experiments) only fit for a part of the information. Discipline-specific metadata models like CML apply only for the chemical part of thermodynamics, but not for the engineering information. This is why we decided to build a model from scratch, building on the standards of DataCite, Codemeta, ExptML and PREMIS. We use or recommend standardized vocabularies where existing, so mainly for the general descriptive and technical metadata fields.

CodeMeta is the foundation for most of the metadata fields of the software entity in our model, describing the software, such as simulation codes used to create the research data. These are the name, contributor, softwareVersion, programmingLanguage, operatingSystem, url, softwareSourceCode, softwareApplication, codeRepository, citation, referencePublication. Only one element is not derived from CodeMeta.

PREMIS builds the only non-CodeMeta element within the software entity, which is the license element. This element is formed by a pm:licenseInformation ComplexType data type. Moreover, PREMIS is used directly inside the main data set entity for storage (whose data type is pm:storageComplexType), format (whose data type is pm:formatComplexType) and rightsStatement (whose data type is pm:rightsStatementComplexType).

ExptML is used only for the instrument element within the processingStep entity to describe experimental instruments used. Therefore, the ex:intrument Type data type from ExptML is used.

DataCite is used throughout different metadata entities in our model. In the context entity, which describes the related work of the research data, it is used as the relatedIdentifierType (with the data type dtc:relatedIdentifierType) and the relationType (with the data type dtc:relationType) . Within the description entity for general descriptive information on the data object, DataCite delivers the descriptionType element with the dtc:descriptionType. Within the resourceType entity, it handles the resourceTypeGeneral element with the dtc:resourceType. The personOrOrganization entity, representing general information on involved stakeholders, includes the role element as a dtc:contributor Type DataCite type. The fundingReference entity includes the funderIdentifierType element, using the dtc:funderIdentifierType data type. The title entity uses the dtc:titleType data type as the titleType element. In the same way, the data entity uses the dtc:dateType data type as the dateType element.

Figure 4: Metadata model with the inflated central entities dataset, processingStep and system and their relation to existing metadata standards.

Figure 4 shows the metadata model with the inflated central entities dataset, processingStep and system and their relation to existing metadata standards. Since the model is a data centric metadata model, we find the dataset entity as the central entity, where all other metadata entities converge to one extensive description of a piece of research data. However, processingStep is an important entity, since all work that is converging to a dataset is done in one or several processing steps with its parts underneath, such as the the observed system or the software used.

The whole model is implemented as an XML schema333The full XSD file as well as an example can be found online: https://bit.ly/2WQTWv3, last checked on May, 24th, 2019., having the advantage of a strict structure that can be verified against.

Let’s assume that we want to describe the results of a thermodynamical simulation of the binding energies of two big molecules run on a HPC platform with the help of the Open Source Software Gromacs and post-processed and analyzed with Python scripts. The components of the

observed system would be the names and SMILES codes of the molecules and the solvent with the used force field (with names and parameters as attributes) as sub-elements. The measured variable is the distance between the molecules, the controlled variables are the number of molecules, temperature and pressure. The temporal resolution would be described through the number of time steps with their interval inbetween.

There are a lot of data files produced in three processing steps. The first processing step from type “data generation” describes the simulation itself. It links the input files as input and the resulting trajectory files as output and optionally documents the researcher as actor and the end date of the simulation as creation date. Gromacs would be the software used, described with name and version and optionally with further description like a link to the source code or a describing publication. The method would be “thermodynamical simulation with umbrella sampling” with the parameters “integrator”, “thermostat” and “barostat”. The computational environment could contain the name of the cluster, the number of nodes and cores used and optionally the compiler with its parameters.

The second processing step from type “post processing” has the trajectory files as input, the cleaned data files as output and the python script as software here defined by a link to the script file.

The third processing step is from type “analysis” with the cleaned output file as input and the tabular data with the summarized results as output and the statistical method as methods. The error method

denotes “standard error from decorrelation” as the information on uncertainty.

All relevant data files from the processing steps would be recorded with filename, link or pid and checksum.

3.3 Crosswalk to PROV

PROV (Belhajjame et al., 2013a) is a W3C-Standard to capture provenance information in a structured way. It comes with a data model (Belhajjame et al., 2013b) and with implementations in the form of an ontology and an XML scheme as well as a human readable notation. In its base model, PROV connects activities with agents and entities through relations. Entities are used or generated by activities and attributed to agents. Activities are associated with agents. The provenance information of EngMeta is a list of processingSteps. Each processingStep defines the stage in the research process (data generation, post processing, analysis, visualization), the date and actor of the step, and, optionally, input and output files, used (error)methods, software, instruments, computing environment and execution command. To convert EngMeta in PROV, each processing step becomes an activity, each actor becomes an agent and each other piece of information about a processing step becomes an entity. The activity is connected via the uses relation with the entities for input files, methods, instruments, software, computing environment and execution command and gets the date property of the processing step to indicate the sequence of the activities. The output files are connected with the wasGeneratedBy relation with the activity. Figure 5 visualizes the conversion of a processing step of EngMeta into PROV.

Figure 5: Conversion of a processing step of EngMeta into PROV

4 Usage of EngMeta

EngMeta provides the opportunity to describe a dataset and its corresponding research process with a lot of information. But the metadata schema will only be used, if there are tools to help with the generation and the management of the metadata. We embedded EngMeta in a data repository based on Dataverse and implemented a tool for the extraction of metadata already available in log and input files.

4.1 Automated Metadata Extraction from Simulation Files

The tagging of metadata is a burden to the researchers. On the one hand, it is necessary for good research data management, but on the other, it is a time-consuming activity that researchers prefer to invest in their scientific endeavour. This is why we developed an approach for automated metadata extraction. In computational engineering, a lot of metadata is already available through the input, output and log files of the simulation codes in a structured or semi-structured form. Typically, technical metadata such as filesystem attributes, process metadata such as the computational environment as well as discipline-specific metadata like controlled variables already exists in some files. For the GROMACS simulation code as an example that is used in thermodynamics, information on the computational environment is scattered through various output- and log-files of the code. An automated extraction of metadata now has the task to collect this metadata from different sources. Then, this data has to be parsed and transferred into the EngMeta scheme. The extraction is generic in a sense that the parsing of the files is directed by a configuration file. This means that the extraction is applicable to and configurable for different simulation codes and outputs and even user generated readme files. The configuration file contains all the necessary information for parsing, which is the metadata key in terms of the EngMeta specification, the location where to search for it, the search key (how to find it), the delimiter that separates the key from the value and other information for the semantics of the results (i.e. which keys belong together). With this approach, all metadata in any textual output files of the simulation codes with syntax can be parsed.

Since most supercomputer and cluster systems are based on some kind of Linux, the automated metadata extraction was implemented in Java due to interoperability. With Java, Windows operating systems are also supported. Windows support is crucial since experimental systems often operate on this operating system. We have developed two versions of automated metadata extraction. A native version, where all parsing is done linearly with the Java Scanner API and a parallel version that uses the Apache Spark Data Analytics Framework444https://spark.apache.org/, access 7.6.2019. The rationale behind these two versions is that a native version is needed to ensure compatibility on all computer systems that support Java 1.8 and the parallel version might be needed when large output files should be analyzed and parallelisation is an advantage. However, the drawback of the latter is that one needs to have the Spark Framework installed on the cluster.

4.2 DaRUS Repository

To test the metadata scheme in practice, we implemented the scheme in DaRUS, the data repository of the University of Stuttgart, based on Dataverse. Dataverse is a repository software especially for data repositories, developed by the IQSS, Harvard together with an international community of developers. A Dataverse repository is hierarchically organized in collections, named dataverses. Each dataverse has its own user management and metadata configuration. While Dataverse is developed mainly for the publication of research data, we use it additionaly for the internal management and sharing of hot data. DaRUS therefore acts as a metadata store.

To implement the scheme in DaRUS, we had to map the metadata fields of EngMeta with the metadata configuration of Dataverse. Dataverse comes out of the box with a set of metadata blocks: citation metadata for the general description of data and a set of discipline-specific metadata blocks for the geo sciences, social sciences, astronomy and astrophysics and the life sciences. Each metadata block consists of either simple fields (key-value pairs) or compound fields consisting of simple fields. New metadata schemes can be added by a super admin for the whole data repository. For each Dataverse (Collection) within DaRUS, the local admin can configure the visible, optional and required fields.

As EngMeta is a hierarchically and multi layered XML-Schema the challenge was to flatten the EngMeta keys and break them into suitable metadata blocks. First, we divided the fields of EngMeta into general descriptive metadata fields, process metadata, and discipline-specific metadata.

The citation metadata block of Dataverse covers most of the general descriptive metadata in EngMeta. We only added the possibility to mark negative results with the success field and a success note to this block.

The discipline-specific metadata includes the information about the observed system (variables, parameters, components) and information about the observation itself (spatial and temporal resolution). As the important parameters vary strongly both between and within engineering disciplines, all parameters can be added with a name and a value. So for example, instead of an extra metadata field for the Reynolds Number the researcher can add a parameter with attribute name = “Reynolds Number” and value=value. This procedure reduces the number of metadata fields and increases the freedom of researchers. However, it also increases the risk of typos and inconsistent names and complicates the search and filter options on this field.

Especially for the process metadata part of EngMeta that assigns the methods used with their parameters, software and hardware to individual processing steps could not be mapped 1:1 in Dataverse. As DaRUS is mainly a search index helping to find the data, we decided to extract the information out of EngMeta most likely to be searched for (software, methods, hardware and parameters used) without mapping to a processing step. To maintain the information about the process and its chronological order, the process part of EngMeta can be transformed into a PROV-File (see section 3.3) and be uploaded into the repository together with the data.

5 Evaluation

Evaluating metadata quality isn’t easy, since just “[l]ike pornography, metadata quality is difficult to define”  Bruce and Hillmann (2004). However in this section, we present a qualitative evaluation based on the two frameworks proposed in Bruce and Hillmann (2004) and NISO (2007). Moreover, we conducted a survey among researchers where they should assess the relevance of the EngMeta fields and check if it fits their needs.

5.1 Qualitative Evaluation

Here, we check EngMeta against the recommendations for good metadata as they were proposed in Bruce and Hillmann (2004) and NISO (2007). In the framework of guidance for building good digital collections, six principles for metadata quality are addressed.

The first principle for good metadata of the framework in NISO (2007) refers to existing standards. According to the first principle, existing metadata standards should be used if possible and self-built metadata schema should be avoided. Any approach should be preceded by a requirement analysis. EngMeta was designed according to this principle even though it is a “homegrown” scheme. We incorporated standards whereever we could, e.g. DataCite, PREMIS, ProvOne, CodeMeta and ExptML. The origin of each metadata field can be seen in figure 4. Moreover, we designed the first version of it in a joint effort with two engineering institutes of the University of Stuttgart, preceded by a requirement analysis published in Iglezakis and Schembera (2018).

The second principle addresses the interoperability of the metadata scheme and means that the metadata information should be technically interoperable and understandable without knowing the context. Within EngMeta, technical or syntactical interoperability is achieved by the usage of XML as a machine-readable and system-independent format for information representation and XSD for a clear definition of the scheme. To ensure semantical understandability, the metadata model offers a wide range of attributes, being categorized into technical metadata, descriptive metadata, process-specific and discipline-specific metadata. With this information, a dataset can be understood as independent of its creators, machines and workflows.

The third principle relates to controlled vocabularies. EngMeta addresses this by using the controlled vocabularies of the incoroporated metatada standards. Moreover, some of the values for metadata entities are pre-defined by the tag.

The fourth principle demands a clear statement of the terms of use. In EngMeta, this is accomplished by the metadata field which is derived from the PREMIS metadata standard.

The fifth principle seeks to include preservation metadata, which is definitely done by EngMeta. EngMeta supports PREMIS metadata files for long-time curation of the data, such as checksum, files sizes and file formats. Moreover, different processing steps can be defined for a dataset.

The sixth principle claims that good metadata needs to include meta-metadata, that is a description of how the metadata is structured and can be understood. Since EngMeta is avaible as XML schema, an explanation is implicit. Additionally, the tags of XML were used to give supplementary information for each metadata entity, and comments were used to give further explanation.

With respect to quality measures proposed in Bruce and Hillmann (2004), the first relates to completeness. This means that a metadata model should be complete in a sense that the target system is described with all the needed information. Moreover, it means that most – or in the best case – all elements are used later. The first part is fulfilled because EngMeta was designed with researchers from neighbouring, but not equal fields of computational engineering, so a basic common ground was determined. All entities that are included have relevance in computation engineering. This is supported by the survey we present in the succeeding section 5.3. The second part implies a quantitative evaluation. Because the repository is not yet in production, we are not yet able to present such a quantitative analysis but hope to complete this in the future.

The second criterion for a good metadata model is accuracy, meaning that it should be unambiguous and the information should be correct. This is fulfilled because EngMeta uses XML schema to have a strict definition of values, ranges, etc. Moreover, controlled vocabularies are used whereever possible.

The third criterion aims to include provenance information, and this provenance information should also be available for the metadata itself, i.e. who created the metadata. This holds for EngMeta, since it contains possibilities to include provenance information inside the entity. Multiple processing steps can be defined, where one can also be used to describe the provenance of the data creation process.

Fourth, the metadata model should conform to the expectations. This is true for EngMeta because it was developed with the computational engineering community. The survey presented in the following section 5.3 supports this argument.

Fifth, metadata has to be logically consistent and coherent, meaning that the elements should be defined according to standards and standard methods should be used for metadata handling, such as crosswalks. EngMeta is a combination of the existing metadata standards DataCite, PREMIS, ExptML and CodeMeta, with a lot of additional fields that were not part of any existing model. Moreover, we implemented a crosswalk to PROV (see section 3.3).

The sixth metadata quality measure is timeliness with respect to the link of the metadata information to the described object. In EngMeta, this is fulfilled due to the possiblity to store a PID. In our approach, in combination with the Dataverse repository, we include a DOI when uploading data and metadata to the repository. This is also the rationale for why we included DataCite as a metadata standard.

The last quality measure refers to accessibility. This means that technical, organizational, economical and trade-related barriers should be avoided. With respect to our metadata model, the model itself is openly accessible and useable. It is understandable since XSD offers a lot of information on how to understand the metadata model. Both metadata and data described with the model can be published in the Dataverse repository, if the creator decides so.

5.2 Experiences

During a test phase of the data repository, first pilot users from different institutes of the University of Stuttgart (aerodynamics, thermodynamics, aircraft construction, mechanics, hydraulic engineering) tested the applicability of EngMeta for their research data. The first results allow only first insights and no quantitative evidence: The most frequently used parameters so far are measured (like density or velocity) and controlled variables (like pressure and temperature), system components and parameters (like temperature coupling, Reynolds or Mach number) and the temporal resolution of the simulation or observation. The controlled variables and system parameters and components are also information the researchers want to search and filter for. Dataverse builds on a SoLR index and offers a full text search for textual and search facets for discrete information, but no search interface for range queries on numerical values. The generic definition of variables and parameters through name and value further complicates such a numerical search.

5.3 Survey

As the practical test of EngMeta in DaRUS only gives qualitative hints, we conducted a survey on the applicability and relevance of EngMeta for the description of research data from different engineering disciplines. The survey took place in the form of an online questionnaire in May/June 2019 at the University of Stuttgart. Five researchers took part in a pretest to determine the filling time and find any errors that may be present. The actual survey was announced at all engineering faculties of the university together with a general note in the newsletter for all employees of the University. In total, 96 researchers participated in the survey, of which 11 persons came from a non engineering discipline and were therefore excluded from the analysis, resulting in 85 participants. Most of the participants () denoted themselves as a researcher, as an institute director, as a group leader and one participant as a technical employee. Table 1 provides an overview of the disciplines of the survey participants.

Discipline Percentage of Participants
aerospace engineering
mechanical engineering
electrical engineering
civil engineering
process engineering
materials science
environment engineering
mechatronics
industrial engineering
Table 1: Disciplines of the survey participants

of the participants use theoretical analysis, simulations and experiments as scientific approach. Most of the participants () have no experience with research data management. The datatype generated during research are mainly tabular data (by ) followed by models (), image files (), binary raw data (), text files (), software (), video files (), workflows (), physical objects or samples () and audio files ().

The participants estimate the relevance of the individual metadata fields of EngMeta for the description of their data on a 5-level Likert scale from

to . Alternatively, participants could indicate that they were unsure how to understand a metadata field. For the discipline-specific part of EngMeta we also asked for technical terms to name the individual fields and for example values of the fields. In addition to engmeta’s metadata fields, we also asked participants for discipline-specific metadata categories from other areas: geo-data to specify a location or area, and information on sampling.

Figure 6: Mean Relevance Estimation of the Metadata Fields of EngMeta

Figure 6 visualizes the mean relevance estimation of the metadata fields and categories through a grey scale. The more relevant a field is for the participants, the darker is the background of the box in the drawing.

The results fit well with the experiences made within DaRUS. Most relevant for the researchers are the discipline-specific fields: the components (, ), bounding conditions (, ), parameters (, ) and description (, ) of the system, the measured (, ) and controlled variables (, ) and the temporal resolution (, ) of the observation.

From the descriptive metadata, the most relevant for the engineers were title with a mean relevance of (), description (, ) and related publication (, ), followed by the data type (, ), the date (, ) and the possibilities to mark negative results (, ) and version the data (, ).

For describing the research process, the participants rated highest the relevance of the used methods (, ) (with name and parameters), input and output files (, ), the classification of the processing step (, ), and the used software (, ) (specified by name and version).

From the technical metadata, the file name (, ) and file type (, ) were the most relevant.

There were some fields whose meaning was unclear to some of the participants. Nearly a third () of the participants had no notion of a persistent identifier, of a checksum, of an embargo. Interestingly, for of respondents, what was meant by controlled variables was also not clear.

The least relevant were, as expected, metadata categories from another disciplines: Geo-data with a mean relevance of () and sampling with a mean relevance of (). From the EngMeta fields, the least relevant information categories are funding (, ), other collaborators apart from the authors (, ), and technical information like the checksum (, ).

All other metadata fields were at least mildly relevant for the engineers with a mean relevance .

Some of the fields were evaluated differently depending on the discipline, mostly non-scientific information like pid, licence, and embargos. But also the relevance of the spatial resolution was much more relevant for researchers from aerodynamics (), civil engineering (), environment engineering (), mechatronics (), and mechanical engineers () than for researchers from industrial engineering (), material sciences (), process engineering (), and electronical engineering (). Information about the sampling, important in the social sciences, were highly relevant for material sciences (), mildly relevant for process engineers (), electronical enginneers () and civil engineers () and mostly irrelevant for scientists from aerodynamics (), mechatronics (), mechanical engineering () and environmental engineering (). Due to the sometimes very small number of participants in some disciplines, however, these differences can only be interpreted as vague indications.

All in all, EngMeta seems to match the information relevant and important for the researchers to describe their data. But, as the survey results imply, researchers should be relieved of the burden of dealing with information from outside the field, such as pids, checksums and legal issues, whether through automated recording or simple guidelines.

6 Conclusions and Future Work

EngMeta has now undergone iterative improvements and serves as the core for several efforts. For the automated metadata extraction, it defines the key to which the extracted information can be mapped. For the DaRUS research data repository, it serves as the center for data description and management. The EngMeta keys define a data object inside the repository.

EngMeta is the first attempt of a description scheme for engineering data, developed mainly with researchers from aerodynamics and thermodynamics. Whereas the evaluation suggests that EngMeta is going in the right direction, we plan to discuss the applicability and concrete structure of the fields with broader circles, both in the fields of scientists and in the field of research data management. The newly founded interest group for research data management in engineering 555https://rd-alliance.org/groups/research-data-management-engineering-ig, last checked June 6th, 2019 of the research data alliance is a good starting point on an international level, whereas the consortium for engineering in the context of the national research data infrastructure plays at the national level. In both communities, EngMeta has already been introduced. As soon as the data repository DaRUS is live, we will publish EngMeta in a proper way and register the schema in the metadata schema catalog of the RDA.666https://rdamsc.bath.ac.uk/, last checked June 6th, 2019.

Among the most important metadata fields are variables and parameters, which can be freely specified in EngMeta with name and value and unit (and optionally with an uncertainty). This gives the scientists freedom and simplifies the scheme, but reduces the standardization and machine actionability of the content. Adding controlled vocabulary for the names and units of these fields is the next step to enhance the interoperability of the content. We are currently in dialogue with the team of the SmartCom-Project 777https://www.ptb.de/empir2018/smartcom/project/overview/, last checked June 6th, 2019 who work on such vocabularies for the engineering field.

Regarding the automated metadata extraction, the parsing as proposed in the paper works fine for basic log and output files. In the future, we will extend the metadata extraction to parse files with a syntax different from the plain style.

As another forthcoming work, we tend to conduct a quantitative evaluation with the metrics proposed in Gavrilis et al. (2015). The DaRUS repository will go into production in mid 2019, so we will increase our quantitative basis for this during the year, getting more and more information about the real usage of the metadata fields.

Acknowledgement

The DiplIng project is funded by the Federal Ministry of Education and Research under Grant No. FDM-008.

References

  • (1) Apache Spark. Apache spark - lightning-fast unified analytics engine. https://spark.apache.org/. accessed Dec 13, 2019.
  • Belhajjame et al. (2013a) Belhajjame, K., Deus, H., Garijo, D., Klyne, G., Missier, P., Soiland-Reyes, S., and Zednik, S. (2013a). PROV Model Primer. https://www.w3.org/TR/2013/NOTE-prov-primer-20130430/. Technical report, W3C.
  • Belhajjame et al. (2013b) Belhajjame, K., Reza, B., Cheney, J., Coppens, S., Cresswell, S., Gil, Y., Groth, P., Klyne, G., Lebo, T., McCusker, J., Miles, S., Myers, J., Sahoo, S., and Tilmes, C. (2013b). PROV-DM: The PROV Data Model. http://www.w3.org/TR/2013/REC-prov-dm-20130430/. accessed Nov 25th, 2019, W3C Recommendation.
  • Bruce and Hillmann (2004) Bruce, T. R. and Hillmann, D. I. (2004). The continuum of metadata quality: defining, expressing, exploiting. Metadata in Practice, pages 238–256.
  • CMLC (2012) CMLC (2012). Chemical Markup Language - CML. http://www.xml-cml.org/. accessed Nov 25, 2019.
  • DataCite (2017) DataCite (2017). DataCite Metadata Schema for the Publication and Citation of Research Data. Version 4.1. doi: 10.5438/0015.
  • Erickson and Maali (2014) Erickson, J. and Maali, F. (2014). Data catalog vocabulary (DCAT). W3C recommendation, W3C. http://www.w3.org/TR/2014/REC-vocab-dcat-20140116/. accessed Jan 7th, 2020.
  • Gavrilis et al. (2015) Gavrilis, D., Makri, D.-N., Papachristopoulos, L., Angelis, S., Kravvaritis, K., Papatheodorou, C., and Constantopoulos, P. (2015). Measuring quality in metadata repositories. In Kapidakis, S., Mazurek, C., and Werla, M., editors, Research and Advanced Technology for Digital Libraries, pages 56–67, Cham. Springer International Publishing. doi: 10.1007/978-3-319-24592-8_5.
  • Green and Humphrey (2013) Green, A. and Humphrey, C. (2013). Building the DDI. IASSIST Quarterly, 37:36–44.
  • Grunzke et al. (2014) Grunzke, R., Breuers, S., Gesing, S., Herres-Pawlis, S., Kruse, M., Blunk, D., de la Garza, L., Packschies, L., Schäfer, P., Schärfe, C., Schlemmer, T., Steinke, T., Schuller, B., Müller-Pfefferkorn, R., Jäkel, R., Nagel, W. E., Atkinson, M., and Krüger, J. (2014). Standards-based metadata management for molecular simulations. Concurrency and Computation: Practice and Experience, 26(10):1744–1759. doi: 10.1002/cpe.3116.
  • Hillmann (2008) Hillmann, D. I. (2008). Metadata quality: From evaluation to augmentation. Cataloging & Classification Quarterly, 46(1):65–80.
  • Iglezakis (2019) Iglezakis, D. (2019). Relevance of Different Metadata Fields for the Description of Research Data from the Engineering Sciences. https://doi.org/10.18419/darus-501.
  • Iglezakis and Schembera (2018) Iglezakis, D. and Schembera, B. (2018). Anforderungen der Ingenieurwissenschaften an das Forschungsdatenmanagement der Universität Stuttgart - Ergebnisse der Bedarfsanalyse des Projektes DIPL-ING. o-bib. Das offene Bibliotheksjournal, 3. doi: 10.5282/o-bib/2018H3S46-60.
  • Iglezakis and Schembera (2019) Iglezakis, D. and Schembera, B. (2019). EngMeta - a Metadata Scheme for the Engineering Sciences. https://doi.org/10.18419/darus-500.
  • Lautenschlager et al. (1998) Lautenschlager, M., Toussaint, F., Thiemann, H., and Reinke, M. (1998). The CERA-2 data model. https://www.pik-potsdam.de/cera/Descriptions/Publications/Papers/9807_DKRZ_TechRep15/cera2.pdf. accessed Jan 7, 2020.
  • Murray-Rust and Rzepa (2011) Murray-Rust, P. and Rzepa, H. S. (2011). CML: Evolution and design. J. Cheminformatics, 3:44.
  • (17) NFDI4ING. Metadata4Ing. https://nfdi4ing.de/projects/metadata4ing/. accessed Dec 13, 2019.
  • NISO (2007) NISO (2007). A framework of guidance for building good digital collections. https://www.niso.org/sites/default/files/2017-08/framework3.pdf. accessed Jan 7, 2020.
  • Park (2009) Park, J.-R. (2009). Metadata quality in digital repositories: A survey of the current state of the art. Cataloging & Classification Quarterly, 47(3-4):213–228.
  • Park and Tosaka (2010) Park, J.-R. and Tosaka, Y. (2010). Metadata quality control in digital repositories and collections: Criteria, semantics, and mechanisms. Cataloging & Classification Quarterly, 48(8):696–715.
  • Research Data Alliance (2019) Research Data Alliance (2019). Research Metadata Schemas WG. https://www.rd-alliance.org/groups/research-metadata-schemas-wg. accessed Nov 25, 2019.
  • (22) Research Data Alliance IG Engineering. Research Data Management in Engineering IG. https://www.rd-alliance.org/groups/research-data-management-engineering-ig. accessed Dec 13, 2019.
  • (23) Research Data Alliance Metadata. Metadata standards catalog. https://rdamsc.bath.ac.uk/. accessed Dec 13, 2019.
  • Rousidis et al. (2014a) Rousidis, D., Garoufallou, E., Balatsoukas, P., and Sicilia, M.-n. (2014a). Metadata for big data: A preliminary investigation of metadata quality issues in research data repositories. Inf. Services and Use, 34(3-4):279–286. doi: 10.3233/ISU-140746.
  • Rousidis et al. (2014b) Rousidis, D., Sicilia, M.-n., Garoufallou, E., and Balatsoukas, P. (2014b). Data quality issues and content analysis for research data repositories : The case of dryad. In Polydoratou, P. and Dobreva, M., editors, ELPUB, pages 49–58. IOS Press. doi: 10.3233/978-1-61499-409-1-49.
  • (26) Schema.org. Dataset. http://schema.org/Dataset. accessed Dec 13, 2019.
  • Schembera and Bönisch (2017) Schembera, B. and Bönisch, T. (2017). Challenges of Research Data Management for High Performance Computing. In Proceedings of the International Conference on Theory and Practice of Digital Libraries, pages 140–151. Springer, Cham. doi: 10.1007/978-3-319-67008-9_12
  • Schembera and Iglezakis (2019) Schembera, B. and Iglezakis, D. (2019). The Genesis of EngMeta - A Metadata Model for Research Data in Computational Engineering. In Garoufallou, E., Sartori, F., Siatri, R., and Zervas, M., editors, Metadata and Semantic Research, pages 127–132, Cham. Springer International Publishing. doi: 10.1007/978-3-030-14401-2_12.
  • (29) Dataverse Team. About The Project. https://dataverse.org/about. accessed Dec 09, 2019.
  • Vardigan et al. (2008) Vardigan, M., Heus, P., and Thomas, W. (2008). Data documentation initiative: Toward a standard for the social sciences. IJDC, 3(1):107–113. doi: 10.2218/ijdc.v3i1.45.
  • Walker (2012) Walker, A. (2012). CMLComp - eMinerals and Materials Grid resources. http://homepages.see.leeds.ac.uk/~earawa/CMLComp/index.html. accessed Nov 25, 2019.
  • Wilkinson et al. (2016) Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., et al. (2016). The fair guiding principles for scientific data management and stewardship. Scientific data, 3.