Scientific progress in biodiversity research, a field dealing with the diversity of life on earth - the variety of species, genetic diversity, diversity of functions, interactions and ecosystems , is increasingly achieved by the integration and analysis of heterogeneous datasets [25, 12]. Therefore, locating and finding proper data for synthesis is a key challenge in daily research practice. Datasets can differ in format and size. Interesting data is often scattered across various repositories focusing on different domains. In a survey conducted by the Research Data Alliance (RDA) Data Discovery Group , 35% of the 98 participating repositories stated that they host data from Life Science and 34% indicated they cover Earth Science. All of these are potentially of interest to biodiversity researchers.
However, the offered search services at public data providers do not seem to support scholars effectively. A study by Kacprzak et al.  reports that 40% of the users, who had sent data search requests to two open data portals, said, that they could not find the data they were interested in and thus directly requested the data from the repository manager. In several studies, ecologists report on the difficulties they had when looking for suitable datasets to reuse   . Scholars from research projects we are involved in also complain that data discovery is a time-consuming task. They have to search in a variety of data repositories with several different search terms to find data about species, habitats, or processes. Thus, there is a high demand for new techniques and methods to better support scholars in finding relevant data.
In this study, we explore what hampers data set retrieval in biodiversity research. We analyze two building blocks in retrieval systems: information needs (user queries) and underlying data. We want to find out how large the gap is between scholarly search interests and provided data. In order to identify scholarly search interests, we analyzed user questions
. In contrast to user queries, which are usually formulated in a few keywords, questions represent a search context, a more comprehensive information need. Characteristic terms or phrases in these textual resources can be labeled and classified to identify biological entities[42, 55]. Scientific data are not easily accessible by classical text retrieval mechanisms as they were mainly developed for unstructured textual resources. Thus, effective data retrieval heavily relies on the availability of proper metadata (structured information about the data) describing available datasets in a way that enables their Findability, one principle to ensure FAIR data . A survey conducted by the Research Data Alliance (RDA) Data Discovery Group points out that 58% of the 98 participating data repositories index all metadata and partial metadata (52%), and only 33% integrate data dictionaries or variables .
We argue that Findability at least partially depends on how well metadata reflect scholarly information needs. Therefore, we propose the following layered approach:
(A) At first, we identified main entity types (categories) that are important in biodiversity research. We collected questions provided by 73 scholars of three large and very diverse biodiversity projects in Germany, namely AquaDiva , GFBio - The German Federation for Biological Data  and iDiv - The German Research Center for Integrative Biodiversity Research . Two authors of this publication labeled and grouped all noun entities into categories (entity types), which were identified in several discussion rounds. Finally, all proposed categories were evaluated with biodiversity scholars in an online survey. The scholars assigned the proposed categories to important phrases and terms in the questions (Section “A - Information Needs in the Biodiversity Domain”).
(B) Most data providers use keyword-based search engines returning data sets that exactly match keywords entered by a user . In dataset search, the main source are metadata that contain structured entries on measurements, data parameters or species observed rather than textual descriptions. It depends on the metadata schema used how sparse or rich the description turns out to be and which facets are provided for filtering. Therefore, we inspected common metadata standards in the Life Sciences and analyzed, to which extent their metadata schemes cover the identified information categories (Section “B - Metadata Standards in the Life Sciences”).
(C) There are several data repositories that take and archive scientific data for biodiversity research. According to Nature’s list of recommended data repositories , repositories such as Dryad , Zenodo  or Figshare  are generalist repositories and can handle different types of data. Data repositories such as Pangaea  (environmental data) or GBIF  (taxonomic data) are domain specific and only take data of a specific format. We harvested and parsed all publicly available metadata from these repositories and analyzed, if they utilize metadata schemes with elements reflecting search interests. For GBIF, we concentrated on datasets only, as individual occurrence records are not available in the metadata API. We explored how many fields of the respective schemas are actually used and filled (Section “C - Metadata Usage in Selected Data Repositories”).
(D) Finally, we discuss the results and outline how to consider and address user interests in metadata (Section “D - Discussion”).
In order to foster reproducibility, questions, scripts, results, and the parsed metadata are publicly available:
The structure of the paper is as follows: The first part “Definitions” focuses on the clarification of various terms. This is followed by sections that explain basics in Information Retrieval (“Background”) and “Related Work”. The fourth section “Objectives” gives an overview of our research idea. The following four sections contain the individual research contributions described above. Each of these sections describes the respective methodology and results. Finally, section “Conclusion” summarizes our findings.
Since dataset retrieval is a yet largely unexplored research field , few definitions exist describing what it comprises and how it can be characterized. Here, we briefly introduce an existing definition and add our own definition from the Life Sciences’ perspective.
Chapman et al  define a dataset as “A collection of related observations organized and formatted for a particular purpose”. They further characterize a dataset search as an application that “involves the discovery, exploration, and return of datasets to an end user.” They distinguish between two types: (a) a basic search in order to retrieve individual datasets in data portals and (b) a constructive search where scholars create a new dataset out of various input datasets in order to analyze relationships and different influences for a specific purpose.
From our perspective, this definition of a dataset is a bit too restricted. All kinds of scientific data such as experimental data, observations, environmental and genome data, simulations and computations can be considered as datasets. We therefore extend the definition of Chapman et al  as follows:
A dataset is a collection of scientific data including primary data and metadata organized and formatted for a particular purpose.
We agree with Chapman et al.’s definition of dataset search. We use Dataset Search and Dataset Retrieval synonymously and define it as follows:
Dataset Retrieval comprises the search process, the ranking and return of scientific datasets.
Unger et al.  introduced three dimensions to take into account in Question Answering namely the User and Data perspective as well as the Complexity of a task. We argue that these dimensions can also be applied in dataset retrieval.
In conventional retrieval systems users’ search interests are represented as a few keywords that are sent to the system as a search query. Keywords are usually embedded in a search context that can be expressed in a full sentence or a question.
In order to understand what users are looking for, a semantic analysis is needed. Information Extraction is a technique from text mining that identifies main topics (also called entity types) occurring in unstructured text . Noun entities are extracted and categorized based on rules. Common, domain-independent entity types are for instance Person, Location, and Time. When it comes to specific domains, additional entity types corresponding to core user interests need to be taken into consideration. In bio-medicine, according to , the main topics are data type, disease type, biological process and organism. In new research fields such as biodiversity research these main entity types still need to be identified in order to get insights into users’ information needs and to be able to later adapt systems to user requirements.
From the data perspective, a dataset search can be classified into two types based on the source of data: primary data and metadata.
Primary data are scientific raw data. They are the result of scientific experiments, observations, or simulations and vary in type, format, and size.
Metadata are structured, descriptive information of primary data and answer the W-questions: What? has been measured by Whom?, When?, Where? and Why?. Metadata are created for different purposes such as search, classification, or knowledge derivation.
Dataset retrieval approaches focussing on primary data as source data have to deal with different data formats such as tabular data, images, sound files, or genome data. This requires specific query languages such as QUIS  to overcome the ensuing heterogeneity and is out of scope of this paper. Here, we solely focus on dataset retrieval approaches that use metadata as input for search. A variety of metadata standards in the Life Sciences are introduced in Section “B - Metadata Standards in the Life Sciences”.
Scholarly search interests are as heterogeneous as data are. Information needs can range from specific questions where users expect datasets to contain the complete answer to broader questions that are answered partially only by datasets. Furthermore, users construct new datasets out of various input datasets. Unger et al.  characterize the complexity in retrieval tasks along four dimensions: Semantic complexity describes how complex, vague, and ambiguous a question is formulated and if heterogeneous data have to be retrieved. Answer locality denotes if the answer is completely contained in one dataset or if parts of various datasets need to be composed or if no data can be found to answer the question. Derivability describes if the answer contains explicit or implicit information. The same applies for the question. If broad or vague terms appear in the question or answer, additional sources have to be integrated to enrich both, question and/or answer. Semantic tractability denotes if the natural language question can be transformed into a formal query.
In this work, we do not further explore the complexity of questions. We focus on the analysis of user interests and metadata, only.
This section provides background information on which parts are involved in a search process, how the system returns a result based on a user’s query and what evaluation methods and metrics exist in Information Retrieval.
The Retrieval Process
A retrieval system consists of a collection of documents (a corpus) and a user’s information needs that are described with a few keywords (query). The main aim of the retrieval process is to return a ranked list of documents that match a user’s query. The architecture of a retrieval system is depicted in Figure (1): If the document corpus is not given, an optional Crawling Process has to be run beforehand to retrieve and collect documents . The Indexing Process comprises pre-processing steps such as stopword removal, stemming, and spell checks important to clean documents from unnecessary information and to analyze only those terms that truly represent the content of a document. Afterwards, the system counts word frequencies within a document and across all documents. The result is an inverted index. Similar to a book index, this is a list of terms together with the number of occurrences of each term in each document and across all documents. These statistics, generated regularly in background processes, form the basis for a fast access to the documents at search time. The actual search takes place in the Retrieval and Ranking Process whenever a user sends a query to the system and results in a ranked result set being returned to the user.
Based on the underlying Retrieval Model, different ranking functions have been developed to produce a score for the documents with respect to the query. Top-scored documents are returned first. In larger corpora, paging functions allow a subsequent retrieval of further documents. Classical retrieval models are for instance: the Boolean Model  where only documents are returned that exactly match a query. In this model all documents in the retrieved set are equally relevant and therefore it is not considered as a ranking algorithm. It is often used in search engines in combination with further retrieval models such as the Vector Space Model 
. Here, documents are represented by vectors that consist of term weights. The similarity of documents and queries is determined by computing the distance between the vectors.Probabilistic Models 
are based on computations of the probability of a document belonging to the relevant set. For languages where word boundaries are not given, e.g., in Eastern Asian Languages,Language Models have to be applied to get a mathematical representation of the documents. The system analyzes the text documents by means of character-based sliding windows (n-grams) to determine word boundaries and compute statistics. All these classical retrieval models are keyword-based. Thus, retrieval systems only return documents that exactly match the user query.
Evaluation in Information Retrieval
When setting up a retrieval system, various design decisions influencing different parts of the system have to be made. Examples of such decisions are whether to stem terms in the pre-processing phase or which terms to include in the stopword list.
Numerous evaluation measures have been developed to determine the effectiveness of the systems, i.e., the accuracy of the result returned by a given retrieval algorithm. For this purpose, a test collection is required that consists of three things : (1) a corpus of documents, (2) representative information needs expressed as queries, (3) a set of relevance judgments provided by human judges containing assessments of the relevance of a document for given queries. If judgments are available for the entire corpus they serve as baseline (“gold standard”) and can be used to determine how many relevant documents a search system finds for a specific topic.
User queries should be representative for the target domain. Queries are either obtained from query logs of a similar application or domain users are asked to provide example queries . The number of example questions influences the evaluation result. TREC (Text REtrieval Conference) is a long-running, very influential annual Information Retrieval competition that considers different retrieval issues in a number of Tracks, e.g., Genomics Track or Medical Track (https://trec.nist.gov/). Various TREC experiments have shown that the number of queries used for the evaluation matters more than the number of documents judged per query . Therefore, TREC experiments usually consist of around 150 queries (or so-called “topics”) per track.
Common evaluation metrics with respect to effectiveness arePrecision and Recall (PR), F-Measure and Mean Average Precision (MAP) 
. Precision denotes which fraction of the documents in the result set is relevant for a query, whereas recall describes which fraction of relevant documents was successfully retrieved. Both metrics are based on binary judgments, i.e., raters can only determine, if a document is relevant or non-relevant. The F-Measure is the harmonic mean of Precision and Recall.
Precision and Recall can only be used, when a gold standard is provided containing the total number of documents in a corpus that are relevant for a query. However, in applied domains where corpora are established specifically for a particular research field, gold standards are usually not available. Therefore, with the recall being unknown, MAP only requires to get ratings for the TopN-ranked documents to compute an average precision. The assumption here is that users are only interested in the first entries of a search result and usually do not navigate to the last page. The top-ranked documents get higher scores than the lower ranked ones . Another metric proposed by Järvelin and Kekäläinen is the Discounted Cumulated Gain (DCG), a metric that uses a Likert-scale as rating scheme and allows non-binary ratings. All entries of the scheme should be equally distributed. However, DCG does not penalize for wrong results but only increases the scores of top-ranked documents. Other evaluation criteria concentrate on the efficiency (e.g., the time, memory and disk space required by the algorithm to produce the ranking ), user satisfaction on the provided result set and visualization .
The missing aspect in evaluation approaches in Information Retrieval is the analysis of the underlying documents. The data source in classical Information Retrieval systems is unstructured text whereas Dataset Retrieval is based on structured metadata files. Hence, retrieval success depends on the metadata format used, the experience of the curator, and the willingness of the individual scholar to describe data properly and thoroughly. This structured information could be used in the search index. For instance, if researchers provide information such as taxon, length, location or experimental method in the metadata explicitly, a search application could offer a search over a specific metadata field. Thus, we argue that there is a need to analyze the given metadata and to quantify the gap between a scholar’s actual search interests and the metadata primarily used in search applications.
This section focuses on approaches that analyze, characterize and enhance dataset search. We discuss studies identifying users’ information needs and introduce existing question corpora. In a second part, we describe approaches that aim at improving dataset search.
In order to understand user behavior in search, query logs or question corpora are valid sources. Kacprazal et al  provide a comprehensive log analysis of three government open data portals from the United Kingdom (UK), Canada, and Australia and one open data portal with national statistics from the UK. 2.2 million queries from logs provided by the data portals (internal queries) and 1.1 million queries issued to external web search engines (external queries) were analyzed. Two authors manually inspected a sample set of 665 questions and determined the main query topics. Most queries were assigned to Business and Economy (20 internal queries, 10 external queries) and Society (14.7 internal queries, 18 external queries). Besides query logs, Kacprazal et al  also explicit requests by users for data via a form on the website. Here, users provided title and description which allowed the authors to perform a deeper thematic analysis on 200 manually selected data requests. It revealed that geospatial (77.5) and temporal (44) information occurred most, often together with a specific granularity (24.5), e.g., “hourly weather and solar data set” or “prescription data per hospital”. Users were also asked why they had requested data explicitly, and more than 40 indicated that they were not able to find relevant data via the provided search.
In the Life Sciences, Dognan et al  inspected one month of log data with more than 58 million user queries from PubMed , a platform providing biomedical literature. They randomly selected 10,000 queries for a semantic analysis. Seven annotators categorized the queries along 16 given categories. They distinguished between bibliographic queries (44) containing information such as journal name, author name, or article title and non-bibliographic queries with domain specific categories. The most frequent category over all questions was “Author Name” (36) followed by “Disorder” (20) comprising diseases, abnormalities, dysfunctions etc., and “Gene/ Protein” (19 ). Further main topics were abbreviations (mostly from genes/ proteins) and chemicals/drugs.
A large study on user needs in biodiversity research have been conducted in the GBIF community in 2009 [20, 5]. The aim was to determine what GBIF users need in terms of primary data and to identify data gaps in the current data landscape at that time. More than 700 participants from 77 countries took part in the survey. It revealed that scholars used retrieved primary data for analyzing species diversity, taxonomy, and life histories/ phenology. That mainly required “taxon names, occurrence data and descriptive data about the species” . As biodiversity is a rapidly changing research field, the authors recommend to repeat content need assessments in frequent intervals .
Apart from query logs, question corpora are another source for identifying search interests. Usually, questions are collected from experts of a particular research field and important terms representing main information needs are labeled with categories or so-called entity types. These manually generated annotations help understanding what information users are interested in and developing tools and services to either automatically extract these interests from text (Text Mining), to retrieve relevant data (Information Retrieval) or to provide an exact answer for that information need (Question Answering).
In the Life Sciences, question corpora for text retrieval have been mainly established in the medical and biomedical domains. One of the largest corpora in medicine is the Consumer Health Corpus , a collection of email requests (67%) received by the U.S. National Library of Medicine (NLM) customer service and search query logs (33%) of MedlinePlus, a consumer-oriented NLM website for health information. The final corpus consists of 2614 questions and has been integrated into the Medical Question Answering Task at TREC 2017 LiveQA . Six trained domain experts were involved in the annotation tasks to manually label information. The experts had to indicate named entities, e.g., problem, anatomy or measurement and labeled question topics such as the cause of a disease or complications (longer term effects of a disease).
A common question corpus in biomedicine is the Genomics Track at TREC conferences . The topics of the retrieval tasks are formulated as natural language questions and contain pre-labeled main categories, e.g., What [GENES] are involved in insect segmentation?. A further large question corpus in biomedicine is the question corpus created for the BioASQ challenge 
, an annual challenge for researchers working on text mining, machine learning, information retrieval, and question answering. The tasks are split into three parts: (1) the extraction of main entities and their linkage with ontological concepts (semantic annotation), (2) the translation of natural language queries into RDF triples, and (3) the retrieval of the exact answer to a natural language query. The question corpus was created and annotated by a team ofexperts, selected with the goal to cover different ages and complementary expertise in the fields of medicine, biology, and bioinformatics . Each expert was asked to formulate questions in English that reflect “real-life information needs”. However, the type of questions to be formulated was restricted, e.g., the experts were instructed to provide questions of certain types typically considered in question answering systems (yes/no, factoid, etc.). These restrictions are justified to a certain degree since they affect the applicability of the resulting corpus for evaluation purposes of question answering approaches. However, they have an impact on which questions are formulated and how. This will likely lead to a bias in the question corpus.
Another question corpus in the biomedical domain is the benchmark developed for the 2016 bioCADDIE Dataset Retrieval Challenge . This benchmark was explicitly created for the retrieval of datasets based on metadata and includes 137 questions, 794,992 datasets gathered from different data portals in XML structure, and relevance judgments for 15 questions. Similar to the BioASQ challenge, domain experts got instructed on how to create questions. Based on templates, the question constructors formulated questions using the most desired entity types, namely data type, disease type, biological process, and organism.
At present, to the best of our knowledge, there is neither a public log analysis nor a question corpus available for biodiversity research. In order to understand genuine user interests and to improve current dataset retrieval systems, unfiltered information needs are crucial. Therefore, collecting current search interests from scholars is the first step in our top-down approach presented in Section “Objectives”.
A study by the RDA Data Discovery Group points out  that most data repositories offer search applications based on metadata and utilize one of the existing and widely spread search engines for data access, e.g., Apache Solr (http://lucene.apache.org/solr/) or elasticsearch (https://www.elastic.co/products/elasticsearch). Large data repositories such as GBIF , PANGAEA  or Zenodo  also use elasticsearch and offer public search services. Apache Solr and elasticsearch are both keyword-based and return datasets that exactly match a user’s entered query terms. If the desired information need is not explicitly mentioned in the metadata, the search will fail.
In recent years, a variety of approaches have emerged to improve dataset search. A common approach is to annotate metadata with entities from schema.org (https://schema.org). Favored by Google  and the RDA Discovery Task Group , the idea is to add descriptive information to structured data such as XML or HTML in order to increase findability and interoperability. These additional attributes help search engines to better disambiguate terms occuring in text. For example, Jaguar could be a car, an animal or an operating system. By means of schema.org entities, data providers can define the context explicitly. Numerous extensions for specific domains have been developed or are still in development, e.g., bioschemas.org  for the Life Sciences. Since Google launched its beta version of a dataset search in Fall 2018 (https://toolbox.google.com/datasetsearch), schema.org entities got more and more attention. Hence, data centers such as PANGAEA  or Figshare  are increasingly incorporating schema.org entities in their dataset search.
Other approaches favor an improved metadata schema. Pfaff et al  introduce the Essential Annotation Schema for Ecology (EASE). The schema was primarily developed in workshops and intensive discussions with scholars and aims to support scientists in search tasks. The MIBBI project  (now known as BioSharing or FAIRSharing portal - https://fairsharing.org/) also recognized that only improved metadata allow information seekers to retrieve relevant experimental data. They propose a harmonization of minimum information checklists in order to facilitate data reuse and to enhance data discovery across different domains. Checklist developers are advised to consider “’cross-domain’ integrative activities” when creating and maintaining checklists. In addition, standards are supposed to contain information on formats (syntax), vocabularies and ontologies used.
The latter points to an increasing interest in semantic techniques that have emerged over the past decade. Vocabularies such as the Data Catalog Vocabulary (DCAT)  or the Vocabulary of Interlinked Datasets (VoID)  aim to describe datasets semantically in RDF  or OWL  format based on subject, predicate, and object triples. Fully semantic approaches such as BioFED  offer a single-point-of-access to 130 SPARQL endpoints in the Life Sciences. They integrate a variety of heterogeneous biomedical ontologies and knowledge bases. Each data source is described by VoID descriptors that facilitate federated SPARQL query processing. The user interface permits simple and complex SPARQL queries and provides support in creating federated SPARQL queries. The result set contains provenance information, i.e., where the answer has been found, “the number of triples returned and the retrieval time”. However, improvements in the user interface still remain necessary. As BioFED is mainly focused on a linked data approach, it requires all data sources to be stored in semantic formats and users to have at least basic SPARQL knowledge. In contrast, Kunze and Auer  consider the search process in their search over RDF datasets as an exploratory task based on semantic facets. Instead of SPARQL queries or keyword-based user interfaces, they provide parameters for filtering. This allows an unambiguous search and returns relevant datasets that match the provided filter parameters.
Other federated approaches outside semantic techniques attempt to align heterogeneous data sources in one search index. That allows the use of conventional search engines and keyword-based user interfaces: DataONE is a project aiming to provide access to earth and environmental data provided by multiple member repositories . Participating groups can provide data in different metadata formats such as EML, DataCite or FGDC . DataONE is currently working on quantifying FAIR . Their findability check determines if specific metadata items such as title, abstract or publication date are present. For title and abstract, they additionally check the length and content. Based on these criteria, they evaluated their data and found out that concerning Findability around 75% of the available metadata fulfilled the self-created criteria. The German Federation for Biological Data (GFBio)  is a national infrastructure for research data management in the green Life Sciences and provides a search over more than six million heterogeneous datasets from environmental archives and collection data centers. It was extended to a semantic search  that allows a search over scientific names, common names, or other synonyms. These related terms are obtained from GFBio’s Terminology Service  and are added in the background to a user’s query.
As described above, numerous approaches have been proposed and developed to improve dataset search. However, what is lacking is a comprehensive analysis on what exactly needs to be improved and how large the actual gap is between user requirements and given metadata.
Current retrieval evaluation methods are basically focused on improving retrieval algorithms and ranking. Therefore, question corpora and documents are taken as given and are not questioned. However, if the underlying data do not contain the information users are looking for, the best retrieval algorithm will fail. We argue, in dataset search, metadata, the basic source for dataset applications, need to be adapted to match users’ information needs.
We want to find out how large the gap in biodiversity research is between actual user needs and provided metadata and how to overcome this obstacle. Thus, the following analysis aims to explore:
What are genuine user interests in biodiversity research?
Do existing metadata standards reflect information needs of biodiversity scholars?
Are metadata standards utilized by data repositories useful for data discovery? How many metadata fields are filled?
Do common metadata fields contain useful information?
We take a top-down approach starting from scholars’ search interests, then looking at metadata standards and finally inspecting the metadata provided in selected data repositories.
(A) First, we generate an annotated question corpus for the biodiversity domain: We gather questions from scholars, explore the questions and identify information categories. In an online evaluation, domain experts assign these categories to terms and phrases of the questions (Section “A - Information Needs in the Biodiversity Domain”).
(B) We inspect different metadata standards in the Life Sciences and compare the metadata elements to the identified search categories from (A) (Section “B - Metadata Standards in the Life Sciences”).
(C) We analyze the application programming interfaces (APIs) of selected data repositories to figure out what metadata standards are used and how many elements of a metadata schema are utilized for data description (Section “C - Metadata Usage in Selected Data Repositories”).
(D) We discuss how to bridge the gap between users’ search interests and metadata. We propose an approach to overcome the current obstacles in dataset search (Section “D - Discussion”).
A - Information Needs in the Biodiversity Domain
Question corpora are common sources for getting an impression what users are interested in in a particular domain. Therefore, we asked biodiversity scholars to provide questions that are specific for their research. We analyzed the questions and identified search topics that represent scholarly information needs in this domain.
The following subsection describes the methodology in detail, divided into four paragraphs.
We gathered questions in three large biodiversity projects, namely CRC AquaDiva , GFBio  and iDiv . We explicitly requested fully expressed questions to capture the keywords in their search context. These projects vary widely in their overall setting, the scientists and disciplines involved and their main research focus. Together, they provide a good and rather broad sample of current biodiversity research topics. In total, 73 scholars with various research backgrounds in biology (e.g., ecology, bio-geochemistry, zoology and botany) and related fields (e.g., hydro-geology) provided 184 questions. This number is comparable to related question corpora in Information Retrieval (e.g., bioCADDIE ) which typically consist of around 100 – 150 questions. The scholars were asked to provide up to five questions from their research background. Questions varied with respect to granularity. The corpus contains specific questions, such as List all datasets with organisms in water samples! or questions with a broader scope, e.g., Does agriculture influence the groundwater?. We published the questionnaires that were handed out in AquaDiva and iDiv as supplementary material in our repository. In the GFBio project, questions were gathered via email and from an internal search evaluation. All questions were inspected by the authors with respect to comprehensibility. We discarded questions which were not fully understandable (e.g., missing verb, misleading grammatical structures) but left clear phrases in the corpus that were not fully expressed as a question. If scholars provided several questions, they were treated individually even if terms referred to previous questions, e.g., Do have earthworm burrows (biopores) an impact on infiltration and transport processes during rainfall events? and Are the surface properties influencing those processes?. In this case, no further adaption towards comprehensibility has been made. The questions were also not corrected with respect to grammar and spelling since changing the grammar could lead to an altered statement. We did not want to loose the original question statement. In some questions, abbreviations occurred without explanations. In these cases, we left the questions as they are and did not provide full terms, since these abbreviations can have various meanings in different biological fields. It was up to the domain experts to either look them up or to leave the term out. After the cleaning, the final corpus consists of 169 questions and is publicly available: https://github.com/fusion-jena/QuestionsMetadataBiodiv/tree/master/questions.
Boundaries of semantic categories are domain-dependent and fuzzy. However, in search, categories support users in finding relevant information more easily and should be valid across various research backgrounds. In a first round, two authors of this work analyzed the collected questions manually. Both have a research background in computer science and strong knowledge in scientific data management, in particular for biodiversity research. The corpus was split up and each of them inspected around 50% of it and assigned broad categories independently of the other one. Afterwards, this first classification was discussed in several sessions. This resulted in 13 categories. The naming was adapted to domain-specific denotations and ontologies. Furthermore, the categories were compared to EASE , a metadata schema which was primarily developed for an improved dataset retrieval in the field of ecology. This comparison revealed that there is an overlap with EASE but that we discovered further relevant categories . The final categories are:
ORGANISM comprises all individual life forms including plants, fungi, bacteria, animals and microorganisms.
All species live in certain local and global ENVIRONMENTS such as habitats, ecosystems (e.g., below 4000 m, ground water, city) and
have certain characteristics (traits, phenotypes) that are summarized with QUALITY & PHENOTYPE, e.g., length, growth rate, reproduction rate, traits.
Biological, chemical and physical PROCESSES are re-occurring and transform materials or organisms due to chemical reactions or other influencing factors.
EVENTS are processes that appear only once at a specific time, such as environmental disasters, e.g., Deepwater Horizon oil spill, Tree of the Year 2016.
Chemical compounds, rocks, sand and sediments can be grouped as MATERIALS & SUBSTANCES.
ANATOMY comprises the structure of organisms, e.g., body or plant parts, organs, cells, and genes.
METHOD describes all operations and experiments that have to be conducted to lead to a certain result, e.g., lidar measurements, observation, remote sensing.
Outcomes of research methods are delivered in DATA TYPE, e.g., DNA data or sequence data is the result of genome sequencing, lidar data is the result of lidar measurements (active remote sensing).
All kinds of geographic information is summarized with LOCATION, e.g., Germany, Hainich, Atlantic Ocean, and
temporal data including date, date times, and geological eras are described by TIME, e.g., current, over time, triassic.
PERSON & ORGANIZATION are either projects or authors of data.
As reflected in the search questions, scholars in biodiversity are highly interested in HUMAN INTERVENTION on landscape and environment, e.g., fishery, agriculture, and land use.
For the evaluation with domain experts we added two more categories, namely OTHER and NONE. The first permits to define an own category, if none of the given ones is appropriate. NONE applies, if the term is not relevant, or if the domain expert does not know the term or if the phrase is too fuzzy and can not be classified into one category.
An annotation process usually has two steps: (1) the identification of terms based on annotation rules and (2) the assignment of an appropriate category in a given context. Usually, an annotator - a domain expert - who is trained in the annotation guidelines, carries out both tasks. However, we argue that training is somewhat biased and influences annotators in their classification decision. This is an obstacle in search where an intuitive feedback for category assignment is required. Hence, we split up the annotation process. Two scholars, who collected the questions and who are familiar with the guidelines conducted the identification, whereas domain experts only received short instructions and assigned categories. Our annotation guidelines, needed to identify phrases and terms (artifacts) to label, are available as supplementary material in our repository.
Annotators and Annotation Process:
Nine domain experts (8 Postdocs, 1 Project Manager) with expertise in various biological and environmental sciences participated in the classification task. All of them have experience in ecology but in addition, each of them has individual research competence in fields such as bio-geography, zoology, evolutionary biology, botany, medicine, physiology, or biochemistry.
For the category assignment, all scholars received a link to an online survey with explanations of the categories (including examples) and short instructions on how to classify the artifacts. A screenshot of the survey is presented in Figure 2. The purpose of this evaluation was also explained to them (improvement of data set retrieval systems). Multi-labeling was not allowed; only one category was permitted per artifact. Should there be no proper category, they were advised to select OTHER and if possible to provide an alternative category. If they did not know a term or phrase, they could decide either to look it up or to omit it. The latter also applied,if they considered a phrase or term to be not relevant or too complicated and fuzzy. As we wanted to obtain intuitive feedback, the experts were told not to spend too much time on the classification decision but to determine categories according to their knowledge and research perspective. The annotators also had the opportunity to skip an artifact. In this case the category NONE was applied. For each question, annotators had the opportunity to provide a comment.
We decided to use a combination of csv files, Python scripts and Limesurvey to support the annotation process. Details on this process can be found in the supplementary material in our repository.
We analyzed the user responses to determine whether the identified information categories are comprehensive and representative for biodiversity research. We computed the inter-rater agreement per artifact to determine the category that best describes an artifact.
Representativeness of the Categories
In order to verify completeness we determined the fraction of artifacts assigned to the category OTHER, i.e., if the experts deemed none of the given categories as appropriate. Figure 3 depicts the frequency of information categories and how often they were selected by the domain experts. As it turned out, the category OTHER was selected by at least expert per artifact for of the phrases and terms and by at least experts for . The fraction of phrases for which at least experts selected the category OTHER was . If at least two domain experts agree that there is no proper category for a given phrase, it is a strong indicator for a missing category or a misinterpretation. This is the case for out of all annotated artifacts. Hence, the coverage of the identified information categories is still high.
However, there might be various reasons why none of the given categories fit: (1) The phrase or term to be annotated was unknown to the annotator such as shed precipitation. (2) Frequently, phrases that refer to data attributes (e.g., soil moisture, oxygen uptake rate or amount of rain) and which were supposed to be covered by the category QUALITY, were classified as OTHER. As alternative category, the annotators proposed “Parameter” or “Variable”. When adding these ratings to the QUALITY category, the results for the OTHER category decreased to //. That strongly indicates that renaming the QUALITY category or adding synonyms would increase comprehensibility significantly. (3) The category OTHER was often chosen for terms used in questions with a broader scope in order to express expected results. However, since this is often vague, scholars tend to use generic terms such as signal, pattern, properties, structure, distribution, driver or diversity. Hence, further discussions in the biodiversity research community are needed to define and classify these terms.
In addition, we wanted to know if there are categories that were not or rarely used by the annotators. This would indicate a low relevance for biodiversity research. As depicted in Figure 3, the categories ENVIRONMENT, ORGANISM, MATERIAL & SUBSTANCES, QUALITY, PROCESS, LOCATION and DATA TYPE have been selected most frequently (assigned to more than of the phrases). Information related to these categories seems to be essential for biodiversity research. Although there were categories that were rarely chosen (PERSON & ORGANISATION and TIME), there was no category that was not used at all.
Consensus of the Categories
In statistics, the consensus describes how much homogeneity exists in ratings among domain experts. We determined the inter-rater agreement and inter-rater reliability using Fleiss’ Kappa ( statistics)  and GWET’s AC . In general, the inter-rater reliability computes the observed agreement among raters “and then adjusts the result by determining how much agreement could be expected from random chance”. values vary between and , where values less than denote poorer than chance agreement and values greater than denote better than chance agreement. As suggested by Landis and Koch , values below indicate fair agreement beyond chance, values between and moderate agreement, values between and substantial agreement and values higher than indicate almost perfect agreement. However, statistics can lead to a paradox: When the distribution of the raters’ scores is unbalanced, the correction for the chance agreement can result in negative values even if the observed agreement is very high . Since this is the opposite of what is expected, a new and more robust statistic has emerged, the GWET’s AC . GWET’s AC considers the response categories in the agreement by chance and the values can range from to .
With a Fleiss’ Kappa of and GWET’s AC of the agreement of the annotators over all categories was moderate. Considering the QUALITY correction, the values increase slightly to for Fleiss’ Kappa and to GWET’s AC. Figure 6a reveals a more detailed picture. It shows the Fleiss’ Kappa for the individual information categories with QUALITY correction. The agreement among the experts was excellent for the categories TIME and ORGANISM and intermediate to good for the categories PERSON & ORGANIZATION, LOCATION, PROCESS, MATERIALS & SUBSTANCES and ENVIRONMENT. The experts’ agreement for the categories EVENT, HUMAN INTERVENTION, ANATOMY, DATA TYPE, METHOD and QUALITY was fair. This lack of agreement can either point to a different understanding of the categories or might indicate that the categorization of the phrase itself was difficult since some phrases, in particular longer ones with nested entities, were fuzzy and difficult to classify in one category. In the latter case, the annotators were advised not to choose a category for that phrase. Our results show, that for of the phrases at least annotators did not provide a category. The fraction of phrases where or more annotators did not choose a category was below . This points out that annotators in fact interpreted the categories with poor agreement differently. This correlates with our results regarding the category QUALITY. For the categories EVENT, HUMAN INTERVENTION, ANATOMY, DATA TYPE, METHOD there is no such evidence. This should be discussed and reconsidered with biodiversity experts.
Comparison of short and long artifacts
We also analyzed the influence of longer artifacts on the result. Table 1 presents the statistic and for artifacts with one term, two terms, three and more terms including the quality correction. As assumed, the longer an artifact is, the more difficult it is to assign an unambiguous category.
|Overall||One Term||Two Terms||Three Terms|
Figure 6b depicts a more detailed picture on the individual categories for artifacts with one and two terms. Since artifacts with three and more terms resulted in coefficients with less than , we left them out in this analysis. One-term artifacts got an excellent agreement () for the categories ORGANISM, TIME and LOCATION and a moderate agreement for ENVIRONMENT, MATERIAL, PROCESS and DATA TYPE. It strikes that PERSON results in a negative value with a poor agreement. Since full person names usually contain two terms, there were no artifacts with one term that could be assigned to PERSON & ORGANIZATION. However, looking at the results for two terms per artifact, the PERSON category reaches an excellent agreement as well as ORGANISM. Surprisingly, PROCESS () got a substantial agreement for two terms pointing out that biological and chemical processes are obviously mainly defined by two terms. The same effect, a larger agreement for two terms than one term, can also be observed for the categories EVENT and HUMAN INTERVENTION. DATA TYPE got a moderate agreement for one and two terms.
All provided categories were used by the annotators to label the artifacts in the questions. However, what stands out is the high number of the category OTHER in the frequency analysis. For 45% out of 592 annotations, at least one domain expert did not assign one of the given categories but selected OTHER. That points to missing interests that are not represented by the given classes. In terms of consensus, seven information categories got a moderate agreement ( 0.4) and five out of these seven were also mentioned very often (), namely ENVIRONMENT (e.g., habitats, climate zone, soil, weather conditions), MATERIAL (e.g., chemicals, geological information), ORGANISM (species, taxonomy), PROCESS (biological and chemical processes) and LOCATION (coordinates, altitude, geographic description) (Figure 7). We conclude that these classes are important search interests for biodiversity research.
In comparison to the outcome of the content assessment analysis conducted in the GBIF community  in 2009, the assumption that user interests change over time has been confirmed. Species are still an important category scholars are interested in, however, further important topics for the acquisition and description of ecosystem services are emerging.
We are aware that this result is not complete and leaves room for improvement. Some category names were misleading and confused the annotators. That is reflected in fair and bad agreement for some categories such as QUALITY (data parameters measured) or DATA TYPE (nature or genre of the primary data). Here, it should be discussed in the research community how they could be further considered in search, e.g., re-naming or merging of categories. Since the backgrounds of the annotators were quite diverse and no training took place, we did not expect completeness and perfect agreement. We wanted to get a real, genuine, and unbiased first picture of biodiversity scholars’ comprehension when looking for scientific data. In biology and biodiversity research, scholars use a specialized language with diverse content and imprecise and inconsistent naming [69, 3]. Hence, labeling and extracting biological entities remain a challenge. Therefore, our thresholds for agreement ( 0.4) and frequency () are not as high as in similar studies in bio-medicine.
Concerning the shortened methodology for the evaluation, our assumptions have been confirmed. It saved a lot time that only a few people did the identification of artifacts to be labeled and that domain experts assigned categories, only. On average, domain experts spent between two and three hours for labeling 169 questions. We conclude that our shortened annotation approach is fine for opening up new domains and getting insights in what scholars are interested in. If the aim is to achieve higher agreement per annotation, we recommend training sessions and trial rounds. However, it should be considered that in this case the unbiased feedback gets lost.
For further reuse of the annotated question corpus, our analysis script also produces an XML file with all questions and annotations above a certain agreement threshold that can be set as a parameter. By default, all annotations per question with an agreement above will be returned.
B - Metadata Standards in the Life Sciences
In this section, we describe a selection of existing metadata standards and investigate whether their elements reflect the identified information categories.
Metadata describe scientific primary data such as experiments, tabular data, images, sound and acoustic files in a structured format, e.g., XML or JSON.
Metadata possess a schema stored typically in an XSD file outlining which elements and attributes exist and which of them are mandatory and/or repeatable. Many schemes employ vocabularies or ontologies to ensure that the metadata use the same names for concepts. In order to become a metadata standard, a schema needs to be formally adopted by a standards’ organization such as the International Organization for Standardization, https://www.iso.org.
There are a variety of metadata standards for different research fields. Table 2 presents a list of 13 metadata standards used in data repositories for the Life Sciences. All metadata standards were obtained from re3data . We filtered for “Life Sciences” and retrieved a list of 25 standards. The categories Other and Repository-Developed Metadata Schema have been left out. The MIBBI standard is outdated and has been integrated into ISA-Tab, so we left it out, too. All other standards that were used in at least 5 repositories have been selected.
We compared them along the focused Domain, Number of Elements in the standard and Mandatory Fields (Table 3). The standards are ranked by the number of data repositories supporting them. The number of elements and required fields were either stated on the standard’s website or they were obtained from the schema. We also examined, if there is support for the Semantic Web, namely, if the standard supports RDF or OWL formats. According to the FAIR principles , community standards, semantic formats, and ontologies ensure interoperability and data reuse. The last two columns denote whether the standard is still maintained and provide some examples of data repositories that support the respective standard.
|Dublin Core (http://dublincore.org/documents/dces/)||DDI (https://www.ddialliance.org/)|
|Dublin Core is a widely used generic metadata standard offering basic fields for the description of research data.||The DDI (Document, Discover, Interoperate) standard addresses metadata from questionnaires and surveys in the social, behavioral, economic and health sciences.|
|Data Cite (https://schema.datacite.org)||ISO19115 (https://www.iso.org/standard/53798.html)|
|Data Cite relates to generic research data and comprises a set of mandatory, recommended and optional properties.||The ISO19115 metadata standard includes the identification, the extent, the quality, the spatial and temporal schema, spatial reference, and distribution of digital geographic data.|
|DIF (https://gcmd.nasa.gov/DocumentBuilder/defaultDif10/guide)||FDGC/CSDGM (https://www.fgdc.gov/metadata/csdgm-standard)|
|The Directory Interchange Format (DIF) is the US-predecessor of ISO 19115 and focuses on the description of geospatial metadata.||The Federal Geographic Data Committee Content Standard for Digital Geospatial Metadata is a legacy national standard for geospatial data developed in the United States. FGDC now encourages its research community to use the international ISO standards.|
|EML (https://knb.ecoinformatics.org/)||Darwin Core (https://dwc.tdwg.org/)|
|The Ecological Metadata Language (EML) is a series of XML document types that can be used in a modular and extensible manner to document ecological data.||The Darwin Core standard provides metadata fields for sharing biodiversity data. It is primarily based on taxa, their occurrence in nature and related information.|
|RDF Data Cube (https://www.w3.org/TR/vocab-data-cube/)||ISA - Tab (https://isa-specs.readthedocs.io)|
|The RDF Data Cube vocabulary aims to describe statistical data. The model is compatible with the cube model that underlies the Statistical Data and Metadata eXchange standard (SDMX, https://sdmx.org/), an ISO standard for exchanging and sharing statistical data and metadata among organizations.||The ISA specification is not a standard but a metadata framework that addresses the description and management of biological experiments. It comprises three core entities to capture experimental metadata: Investigation (the project context), Study (a unit of research) and Assay (analytical measurements).|
|ABCD (https://github.com/tdwg/abcd)||CF (http://cfconventions.org/)|
|The ABCD (Access to Biological Collection Data) metadata standards aims to share biological collection data. It offers a variety of metadata fields to describe specimen and observations, and it is compatible with numerous existing standards.||The Conventions for Climate and Forecast Metadata (CF) comprise geophysical quantities to describe climate and forecast data.|
|The Data Catalog Vocabulary (DCAT) facilitates interoperability between data catalogs in the web and allows dataset search across sites.|
|Standard Name||Domain||Elements||Mandatory Elements||Semantic Support||Maintenance||Examples|
|general||15||No||Yes (RDFS)||Yes||Pangaea, Dryad, GBIF, Zenodo, Figshare|
|questionnaires and surveys in the social, behavioral, economic, and health sciences||1154||7||No||Yes||Dataverse|
|general research data||19 (57)||5||No||Yes||Pangaea, Zenodo, Figshare, Radar|
|geospatial data||N/A||7||No||Yes||Pangaea, NSF Arctic Data Center, coastMap|
|geographic information||342||74||No||No (1998, last update: 2002)||Dataverse, NSF Arctic Data Center|
|ecological data||N/A||N/A||No||Yes||GBIF, GFBio, SNSB, Senckenberg, WORMS, NSF Arctic Data Center|
|biodiversity data||184||No||Yes (RDF)||Yes||GFBio, GBIF, VerNET, Atlas of Living Australia, WORMS|
|statistical data||36||N/A||Yes||Yes||Dryad (only RDF with DublinCore)|
|biological experiments||N/A||Yes (11 blocks)||Yes||Yes||Data Inra, GigaDB|
|geospatial metadata||34(219)||8||No||Yes||Pangaea, Australian Antarctic Data Center, Marine Environmental Data Section|
|climate and forecast||4798—54—70 (lines in the standard table)||No||No||Yes||WORMS, NSF Arctic Data Center, coastMap|
|biological collection data||1418||20||No||Yes (ABCD 3.0)||GBIF, BioCase Network|
|data catalogs, data sets||16||N/A||Yes||Yes||Data.gov.au, European Data Portal|
|N/A denotes that the information was not available|
The standard supported by most repositories is Dublin Core, a general metadata standard based on 15 fields, such as contributor, coverage, creator, date, description, format, and identifier. In addition, data repositories utilize further domain-specific standards with richer vocabulary and structure such as ISO19115 for geospatial data or EML for ecological data. The RDF Data Cube Vocabulary is not used by any of the data centers.
We suppose, the abbreviation RDF DC might lead to a misunderstanding (DublinCore instead of RDF Data Cube).
All standards provide elements that can be described along the questions: Who? What? Where? When? Why? and How?. In particular, contact person, collection or publication date and location are considered with one or several metadata fields in all standards. In order to describe the main scope of the primary data, all standards offer numerous metadata fields but differ in their granularity. While simple ones such as Dublin Core only offer fields such as title, description, format, and type, standards with more elements such as EML or ABCD even offer fields for scientific names, methods and data attributes measured. EML even allows scholars to define the purpose of the study making it the only standard that supports the Why question.
Data reuse and citation also play an important role. As it is demanded by the Joint Declaration of Data Citation Principles  and practical guidelines for data repositories , all standards provide several elements for digital identifiers, license information and citation. In addition, some standards provide elements for data quality checks. For instance, ISO19115 offers a container for data quality including lineage information and EML supports quality checks with the qualityControl element. Surprisingly, 52 repositories stated to use own-developed metadata schemes. That indicates that a variety of data repositories is not satisfied with the existing metadata landscape and therefore started developing their own schema.
For our further analysis, we selected 12 out of the 13 standards shown in Table 3. Since DDI is a standard that was mainly developed for questionnaires and surveys, we decided not to use it.
In our second analysis, we compared the information categories with elements of the metadata schemes to figure out, if search interests can be explicitly described with metadata elements.
Our results are presented in Table 4. For the sake of completeness, we explored all categories from the previous analysis but marked the ones with an asterisk that had a fair agreement ( 0.4). The categories are sorted by frequency from left to right. The red color denotes that no element is available in the standard to express the category, orange indicates that only a general field could be used to describe the category and a light-orange cell implies that one or more elements are available in the standard for this search interest.
|Not provided||Unspecific (generic element)||Available (one or more elements)|
There is no schema that covers all categories. Since the interests are obtained from scholars with various and heterogeneous research backgrounds, this was also not to be expected. Some standards such as ABCD or DarwinCore are discipline-specific and therefore, mainly provide elements that support the respective domain (e.g., collection data).
Apart from HUMAN INTERVENTION, all categories are covered by different metadata schemes. In particular, ISA-Tab followed by ABCD, DarwinCore and EML are frameworks and metadata schemes with elements that cover most of the search interests of biodiversity researchers. EML provides numerous fields to describe ecological data including elements for environmental information (studyAreaDescription), species (taxonomicCoverage) and research methods used (methods). However, important search preferences such as materials (including chemicals) and biological and chemical processes are only explicitly supported by ISA-Tab. Widely used general standards such as DublinCore or DataCite offer at least a general field (dc:subject, subject) that could be used to describe the identified search categories. In DublinCore, at least one metadata field each is provided to describe geographic information, e.g., where the data have been collected (LOCATION), the type of the data (DATA TYPE), the creator and contributor (PERSON & ORGANIZATION) and when it was collected or published (TIME). However, one field is often not enough to distinguish if the provided field is for instance a collection date or publication date, or if the creator of the dataset is also the same person that collected the data. In contrast, DataCite provides individual fields for publication year and the date field can be used with dateType="Collected" to specify a collection date. The metadata field contributor can also be extended with a type to indicate whether the contact details belong to the data collector or the project leader. Bounding box elements are also provided to enter geographic coordinates (LOCATION).
The question that still remains to be answered is whether these detailed metadata standards are actually used by data repositories.
C - Metadata Usage in Selected Data Repositories
In the following analysis, we examine what metadata standards are used in selected data repositories and how many schema elements are actually filled. In a second part, we explore, if descriptive fields of selected files contain data that might be relevant for information seekers.
Scholarly publishers increasingly demand scientific data to be submitted along with publications. Since publishers usually do not host the data on their own, they ask scholars to upload the data at one of the repositories for their research domain. According to Nature’s list of recommended data repositories , we selected five archives for our further analysis: three generalist ones (Dryad, Figshare and Zenodo) and two domain-specific ones (PANGAEA - environmental data, GBIF - taxonomic data). In the biodiversity projects we are involved in, scholars also often mention these repositories as the ones they mainly use.
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a client/server architecture primarily developed for providing and consuming metadata. Data repositories are required to expose their metadata in metadata format and may also support other metadata formats. Metadata consumers, e.g., other institutions or data portals, harvest that data via the provided services on the OAI-PMH server in order to integrate or reuse it in their services. The OAI-PMH protocol comprises a set of six services that are accessible via HTTP. Requests for metadata can be based on a date stamp range or can be restricted to named sets defined by the provider.
We parsed all available metadata from Figshare, Dryad, GBIF, PANGAEA and Zenodo in May 2019 via their respective OAI-PMH interfaces. GBIF only offers the metadata of their datasets in the OAI-PMH interface. The individual occurrence records, which are provided in Darwin Core metadata schema  and belong to a dataset, are available in the search, only. Hence, we only analyzed the metadata of the datasets.
Our script parses the metadata fields of all public records per metadata schema for each of the selected data repositories (Table 5). Apart from the metadata standards introduced in the previous section, a few more standards appear in this list. OAI-DC is an abbreviation for , a mandatory standard in OAI-PMH interfaces. QCD means qualified DublinCore and denotes an extended extending or refining the core elements. ORE (The Open Archives Initiative Object Reuse and Exchange (OAI-ORE)) is a standard for exchanging aggregations of web resources. It can be used together with other semantic standards such as RDF to group individual web resources. We also considered Pan-MD, a metadata schema developed by PANGAEA. It extends with more fine-grained geographic information such as bounding boxes or adds information on data collection. The latter can range from projects, parameters, methods, and sensors to taxonomy or habitats.
MARC21, MARCXML and METS are metadata standards that are mainly used for bibliographic data in digital libraries. Hence, we left them out of our further explorations. We also did not consider the Common European Research Information Format (CERIF) and ORE as they are not focused on describing primary data but research entities and their relationships and grouping web resources, respectively. However, we decided to permit all available repository-developed schemes for Life Sciences such as Pan-MD in order to get an impression how repositories extend metadata descriptions.
Per metadata file we inspected which elements of the metadata standards are used, and we saved their presence (1) or non-presence (0). The result is a csv file per metadata schema that contains dataset IDs and metadata elements used. All generated files are stored in separate folders per repository and metadata format. Each request to a repository returns an XML body that includes several metadata files as records. Each record is separated in two sections, a header and a metadata section. The header section comprises general information such as example ID of the record and a date stamp. The metadata section contains elements of the metadata schema, e.g., the name of the contributors, abstract and publication year. Unused metadata fields are not included in the response. We saved a boolean value encoding whether a metadata field was used or not. The source code and a documentation on how to use it is available in our repository.
|dc:date||http://www.dublincore.org/specifications/dublin-core/dces/||“A point or period of time associated with an event in the lifecycle of the resource.”|
|EML||pubDate||https://knb.ecoinformatics.org/external//emlparser/docs/eml-2.1.1/eml-resource.html#pubDate||“The ’pubDate’ field represents the date that the resource was published.”|
|publicationYear||https://support.datacite.org/docs/schema-40||“The year when the data was or will be made publicly available.”|
|https://gcmd.gsfc.nasa.gov/DocumentBuilder/defaultDif10/guide/metadata_dates.html||“ refers to the date the data was created”|
|ISO19139/ISO19139.iodp||gco:DateTime||https://geo-ide.noaa.gov/wiki/index.php?title=ISO_Dates||(CI-DataTypeCode=publication), publication Date|
|PAN-MD||md:dateTime||http://ws.pangaea.de/schemas/pangaea/MetaData.xsd||publication date (contact to data repository)|
”Date of first broadcast/publication”
For our further consideration, we wanted to obtain a publication date of each downloaded dataset to inspect how many datasets have been published over the years in which format per data repository. Unfortunately, a publication date is not provided in all metadata schemes. Therefore, we looked up each date related field in the schema and used the one that is (based on the description) the closest to a publication date. Table 6 depicts all date stamps utilized and their descriptions. If the respective date stamp was not found in a dataset or was empty, we left the dataset out in the following analysis.
General, descriptive metadata fields such as ‘title’, ‘description’ or ‘abstract’, and ‘subject’ might contain relevant data that are interesting for information seekers. Using conventional retrieval techniques, this data is only accessible in a full text search and if the entered query terms exactly match a term in the dataset. Hence, we aim to explore what information is available in general, descriptive metadata fields.
In a first step, we downloaded descriptive metadata fields, namely, dc:title, dc:description and dc:subject in OAI-DC format from all repositories in October and November 2019. Parallel to the download, we collected the keywords used in the subject field and counted their presence in a separate csv file.
In order to further inspect the content with Natural Language Processing (NLP) tools, we selected a subset of representative datasets. We limited the amount to 10,000 datasets per repository as the processing of textual resources is time-consuming and resource-intensive. A variety of applications have been developed to determine Named Entities (NE) such as geographic locations, persons and dates. Thessen et. al explored the suitability of existing NLP applications for biodiversity research. Their outcome reveals that current text mining systems, which were mainly developed for the biomedical domain, are able to discover biological entities such as species, genes, proteins and enzymes. Further relevant entity types such as habitats, data parameters or processes are currently not supported by existing taggers. Thus, we concentrated on the extraction of entity types that (a) correspond to the identified search interests and for which (b) text mining pipelines are available. We used the text mining framework GATE  and its ANNIE pipeline  as well as the OrganismTagger  to extract geographic locations, persons, organizations and organisms.
The overall statistics are presented in Table 7. At first, we inspected the fields concerning a publication date in a valid format. We could not use all harvested datasets as for some metadata files publication dates were not available. Dryad had a large number of datasets with a status “Item is not available”, which we left out, too. The number in brackets denotes the amount of datasets we used for the following considerations. What stands out is that most repositories provide general standards, only Pangaea and GBIF utilize discipline-specific metadata schemes. and already provide metadata in semantic formats such as RDF. In addition, offers Qualified Dublin Core (QDC), an extended Dublin Core that allows the description of relations to other data sources.
Based on the given publication dates, we computed timelines (Figure 8) for the introduction of the various standards over time per data repository. The code and all charts are available in the repository. As Dryad provides several dc:date elements in the metadata, we used the first available date entry as publication date for the timeline chart.
Per repository, the timelines for the different metadata formats are almost identical. Obviously, when introducing a new metadata format, publication dates were adopted from existing metadata formats. Only Figshare uses new date stamps when a new metadata format is provided. For instance, Figshare’s timeline shows that QDC and RDF were launched in 2015. The result for RDF was too large to process it together with the other metadata formats. Hence, we produced the timeline for RDF separately. The timelines across all repositories reveal a steadily increasing number of datasets being published at GBIF, Dryad, Zenodo and Figshare. For PANGAEA, the timeline points to a constant number of published datasets of around 10,000 datasets a year apart from an initial release phase between 2003 and 2007.
|OAI-DC||186951 (142329)||383899 (383899)||44718 (42444)||255000 (255000)||3128798 (3128798)|
|RDF||186955 (142989)||3157347 (3157347)|
|DATACITE3||383906 (383906)||1268232 (1268232)|
|OAI-DATACITE||1266522 (1266522)||3134958 (3134958)|
Metadata Field Usage
Figure 9 presents how many metadata elements of the best matching standard were filled. The individual results per data archive are available in our repository as supplementary material. Dryad used 9 out of 15 available metadata fields from OAI-DC very often (¿ 80%) including important fields such as dc:title, dc:description and dc:subject. dc:publisher and dc:contributor were provided in less that 20%. For GBIF, the EML standard does not provide a fixed number of core elements. Hence, we analyzed the 129 available fields. Most of them (89 elements) were not filled, e.g., fields describing taxonomic information. Data about author, title and description were provided in more than 80%. The general field eml:keyword was used in around 20%. Out of 124 used fields in PANGAEA’s Pan-MD format, 43 fields were filled in more than 80% of the harvested metadata files including information on the author, project name, coordinates, data parameters and used devices. Fields that were less filled are supplementary fields, for instance for citation, e.g., volume, pages. For Zenodo, all required fields in DataCite (identifier, creator, title, publisher, publication year) were always filled. In addition, title, rights and descriptions as well as resource type were also provided in more than 99% of the analyzed metadata files. However, in only 45% of the metadata files, keywords (subject) were present. Figshare used only 12 out of 17 available fields of QDC, but these fields were always filled.
Category - Field - Match
Per data repository and metadata format, we computed charts that visualize which field was filled at what percentage rate and if they correspond to the categories introduced in Section “A - Information Needs in the Biodiversity Domain”. Table 8 presents a summary of all data repositories and their best matching standard. The individual results per repository and the concrete field-to-category mapping are available in our repository.
Temporal expressions (TIME) and information about author and/or creator (PERSON) were mostly provided in all repositories. Apart from , repositories mainly provided the publication date and only partially added information about when the data was collected. Information about data type and formats was also contained in all metadata files apart from GBIF. The identified search categories were partially covered by two repositories. with EML reflects most of the categories, but fields that correspond to ENVIRONMENT, ORGANISM, DATA TYPE and METHOD were rarely filled. Metadata files in repository-developed standard Pan-MD always contained information on data parameters (QUALITY) and geographic locations. In most cases, research methods and devices used were also given. Dryad provided geographic information (LOCATION) in its dc:coverage field in at least 60%.
|()||(3%)||(11%)||(35%)||(8%)||(18%)||(publication Date - 100%, collection Date - 10%)||(90%)|
|()||(90%)||(100%)||(100%)||(Devices used - 90%, research methods - 65%)||(publication Date - 100%, collection Date - 80%)||(100%)|
|Unspecific (generic element)||Available (one or more elements)|
|Amount in brackets denotes the percentage the element is filled.|
Table 9 presents the Top5 keywords in the metadata field dc:subject for all repositories sorted by their frequencies. The full keyword lists are available in our repository.
For GBIF datasets, in 81% an empty dc:subject field was returned. Zenodo’s metadata provided keywords in 52% of the inspected cases. None of the repositories seem to consider upper and lower cases. For several terms, different spellings resulted in separate entries. Numerous keywords in PANGAEA and Dryad reveal that both repositories host marine data. PANGAEA’s list mainly contains data parameters measured and used devices. In contrast, Dryad’s list indicates that terrestrial data are also provided. For instance, the lower ranked terms contain entries such as Insects (1296) (insects (180)) or pollination (471) (Pollination (170)). Geographic information , e.g., California (9817), also occured in Dryad’s dc:subject field. Zenodo’s and Figshare’s keyword lists contain numerous terms related to collection data. We checked the term ‘Biodiversity’ in both repositories in their search interfaces on their websites. It turned out that the Meise Botanic Garden (https://www.plantentuinmeise.be) provided large collection data in Zenodo. Hence, each occurrence record counted as a search hit and got the label ‘Biodiversity’. We also discovered that Figshare harvests Zenodo data which also resulted in high numbers for Figshare and the keyword ‘Biodiversity’ (219022).
|water (201102)||Occurrence (6510), occurrence (46)||Temperature (16652), temperature (15916)||Taxonomy (459877), taxonomy (105)||Medicine (1057684), medicine (240)|
|DEPTH (198349), Depth(71916)||Specimen (3046), specimen (22)||Integrated Ocean Observing System (16373)||Biodiversity (458336), biodiversity (8593)||Biochemistry (1015906), biochemistry (92)|
|Spectral irradiance (175373)||Observation (2425), observation (24)||IOOS (16373)||Herbarium (270110), herbarium (91)||Biological Sciences not elsewhere classified (983829)|
|DATE/TIME (128917)||Checklist (589), checklist (43)||Oceanographic Sensor Data (15015)||Terrestrial (269900), terrestrial (177)||Chemical Sciences not elsewhere classified (842865)|
|Temperature (118522), temperature (50)||Plantas (368), plantas (42)||continental shelf (15015)||Animalia (205242), animalia (261)||Biotechnology (792223), biotechnology (23978)|
In a second analysis, we investigated which kinds of entity occur in descriptive metadata fields. As the processing of textual resources with NLP tools is time-consuming and resource-intensive, we selected a subset of datasets. We limited the amount to 10,000 datasets per repository. Table 10 presents the filter strategies. For PANGAEA and GBIF, we randomly selected 10,000 datasets as they are domain-specific repositories for which all data are potentially relevant for biodiversity research. For Dryad, the filter consists of a group of relevant keywords, and for Zenodo and Figshare we used the keyword ‘Biodiversity’. Due to the large amount of collection data with the keyword ‘Biodiversity’, we are aware that this filter strategy might have led to a certain bias in the selected data.
|filter strategy||10000 randomly selected (388254)||10000 randomly selected GBIF (46954)||10000 randomly with keywords: biodiversity, climate change, ecology, insects, species richness, invasive species, herbivory, pollination, endangered species, ecosystem functioning, birds (149672)||10000 randomly with keyword: Biodiversity (1467958)||10000 randomly with keyword: Biodiversity (3602808)|
Per data repository, we processed the selected 10,000 files with two open-source taggers of the text mining framework GATE. Named Entities such as Person, Organization and Location were obtained with the ANNIE pipeline , and Organisms were obtained from the OrganismTagger . The results are presented in Table 11. Unfortunately, the OrganismTagger pipeline aborted for PANGAEA and Zenodo, but in around 12% GBIF files, 36% Dryad files and 85% Figshare files ‘Organism’ annotations were created. Probably, the number of ‘Organism’ annotations in Figshare files is that high due to the mentioned bias towards collection data. The number of ‘Organism’ annotations in GBIF files is low since datasets mostly describe the overall study and do not contain concrete species names but rather broader taxonomic terms such as ‘Family’ or ‘Order’. A large number of ‘Location’ annotations were extracted for files from PANGAEA (91%) and Figshare ( 100%). ‘Person’ and ‘Organization’ annotations are largely presented in PANGAEA ( 51%) and GBIF files ( 74%).
The text mining pipelines were originally developed and evaluated with text corpora and not sparse datasets. Hence, the results might contain wrong (false positive) annotations. However, the results indicate that NLP tools can support the identification of biological entities. That could be an approach for generalist repositories to additionally enrich metadata. All scripts and the final results are available in our repository.
|Organism||N/A (pipeline aborted)||1183||3603||N/A (pipeline aborted)||8542|
|Person & Organization||5048||7355||657||192||1645|
D - Discussion
In this study, we explored what hampers dataset retrieval in biodiversity research. The following section summarizes our findings and outlines a proposal on how to bridge the gap between search interests in biodiversity and given metadata. We also highlight challenges that are not fully resolved yet.
Scholarly Search Interests in Biodiversity Research
In order to understand what biodiversity scholars are interested in, we gathered 169 questions, identified biological entities and classified the entities in 13 information categories. In the subsequent evaluation with domain experts, five categories were verified and can be considered as important information needs in biodiversity research. That includes information about habitats, ecosystems, vegetation (ENVIRONMENT), chemical compounds, sediments and rocks (MATERIAL), species (ORGANISM), biological and chemical processes (PROCESS). Further categories being mentioned very often are information about data parameters (QUALITY) and the nature or type of data resources (DATA TYPE). Usually, the latter is an outcome of a certain research method. However, the naming should be discussed in the research community as the comprehensibility of these categories were fair, only.
Comparison of Metadata Standards and User Interests
We selected 13 metadata standards used in the Life Sciences from re3data, and we analyzed whether the elements of the metadata schemes reflect the identified information categories.
Elements of general standards cover the categories to some extent, only. LOCATION and DATA TYPE are the sole information that can be explicitly described with metadata fields of general standards such or . Further elements are focused on information less relevant for search such as data creator, contributor (PERSON), collection or publication data (TIME), and license information. All this information is important for data reuse and data citation and needs to be part of the metadata. However, if the dataset is not findable, it can not be cited. As a general standard, provides many more fields and attributes to describe personal data, time and geographic information. Therefore, it should be provided in addition to .
There are numerous discipline-specific standards that describe search interests quite well. For instance, , and provide elements to describe environmental information, species, methods, and data parameters. ISA-Tab, a framework for genome data and biological experiments covers all important search categories. The only drawback is that it takes time for scholars and/or data curators to fill in all these fields. In some standards such as more than 1000 elements are available. With our work, we aim to provide insights on what scholars are actually interested in when looking for scientific data. We believe that our results could serve as a good start for discussions in the respective research communities to define core elements of discipline-specific standards that are not only focused on data citation but also broaden the view on search interests.
Metadata Analysis of Selected Data Repositories
We selected 5 repositories from Nature’s list of recommended data archives and analyzed the metadata provided in their OAI-PMH interfaces. We wanted to know what metadata standards are used in common data repositories in the Life Sciences and how many elements of the standard are actually filled.
We figured that generalist repositories such as , and tend to use only general standards such as and . Even when using simple standards the repositories did not fully use all provided elements. Furthermore, the ones utilized are not always filled. That hampers successful data retrieval. Most repositories seem to be aware of that problem and enhance metadata with numerous keywords in generic fields such as dc:subject. Discipline-specific repositories, e.g., and are more likely to provide domain-related standards such as or Pan-MD. That supports an improved filtering in search, however, it does not guarantee that the fields are always filled. In GBIF’s case, we are aware that we could not provide a full picture as we did not analyze the occurrence records. Here, only a deeper analysis of the provided fields in the search index would deliver more answers. However, that would require technical staff support as the access to search indices is limited.
Suggestions to Bridge the Gap
In this subsection, we outline approaches to overcome the current obstacles in dataset search applications based on our findings from the preceding sections. Table 12 presents checklists for data repositories and scholars that in the following are discussed in detail.
For Data Repositories
Adherence to the FAIR principles, long-term data preservation and the creation of citable and reusable data are main targets of all data repositories. Therefore, a strong focus of data archives is on generating unique identifiers, linking the metadata to their primary data and publications.
Less considered is the perspective of dataset seekers.
Hence, we propose the following improvements to enhance dataset retrieval.
Keep metadata diversity: Scientific data are very heterogeneous. This diversity can not be reflected in one generic metadata standard. Thus, it is highly recommended to use different domain-specific standards considering the requirements from various research disciplines.
Use proper metadata fields: If search interests are explicitly mentioned in metadata, conventional search techniques are able to retrieve relevant datasets. Providing possible search terms in generic keyword fields supports dataset retrieval in a full-text search but does not allow proper category-based facet creation. Therefore, using proper metadata fields covering potential search interests greatly enhances dataset retrieval and filtering.
In addition, metadata need to have a unique identifier and should answer the W-questions including information on how the data can be re-used. That comprises information on data owner, contact details and citation information.
Extend standards: Metadata standards are developed and adopted from large organizations or research communities for a specific purpose or research fields. They also discuss extensions of new fields or changes of existing elements. If the given fields are not sufficient for particular requirements, the preferred way is to get in touch with the standard organization and to propose new fields or attributes. However, since these processes usually take a long time, it is sometimes unavoidable to extend a schema or to develop a new schema. In these cases, it would be a good scientific practice to give feedback to the standard organization why and how a schema has been changed or extended. That might influence the further development of standards and would counteract the creation of numerous repository-developed schemes.
Use controlled vocabularies: The questions that still remain and that have not been considered so far are how metadata fields are filled - by the data submitter, the data repository or by the system - and whether a controlled vocabulary is used for the keywords and the other metadata elements. When describing scientific data it is highly recommended to use controlled vocabularies or terminologies, in particular for important fields in search. If possible, Linked Open Data  vocabularies should be utilized to better link datasets, publications, authors and other resources. That supports data transparency, data findability and finally data reuse. In the Life Sciences, there are a variety of terminology providers. We provide a list of ontology and semantic service providers in our repository.
Utilize schema.org: Driven by Google and the Research Data Alliance (RDA) Data Discovery Group, the enrichment of HTML with schema.org (https://schema.org) entities became very popular in recent years. The enrichment helps to identify unique identifiers, persons, locations or time information in the HTML file. That supports external search engines or data providers to crawl the landing pages of search applications provided by the data repositories per dataset.
As the current schema.org entities do not fully reflect scientific search interests, more attention should be paid to initiatives such as bioschemas.org (https://bioschemas.org/) that aims to expand schema.org on biological entities such as species and genes. That confirms and complements our recommendations for explicit metadata fields tailored to search interests. At the time of writing this paper, bioschemas.org is still in draft mode. However, in the future, efforts like this will improve dataset retrieval significantly.
Extract implicit information: Apart from short information such as contact details, data type or location, metadata usually contain longer textual resources such as title, description and abstract. Most of them contain useful information for search and mention species observed or describe environments where data has been gathered. These resources could be used to extract implicit information and to automatically identify further relevant data.
Documenting scientific data is a disliked task that also takes time. Therefore, scholars attempt to minimize the effort on describing their data and are pleased when data repositories offer not too many fields to fill in for data submission. However, scholars are responsible to properly document their data so that other researchers are able to find and reuse it. Hence, each scholar should carefully and thoroughly describe the produced research data. Based on our findings, we summarize what should be considered when submitting scientific data to a data repository.
Prefer domain-specific repositories: As generalist repositories tend to offer only general metadata standards for data description, preference should be given to domain-specific data archives. This is also recommend by highly influential journals such as Nature . Another advantage is that repositories that are familiar with the research domain might give more qualitative feedback on the submitted data descriptions.
Use domain-specific metadata standards: Even when selecting a domain-specific data repository, it does not guarantee that archives use proper metadata standards. Scholars are advised to know at least a few appropriate standards for their research field and to ask the repository if one of these standards are supported if not stated anywhere.
Fill in all relevant fields with controlled vocabularies: All relevant metadata fields should be filled in. That enhances the chance that datasets are retrieved. When describing the data, scholars should attempt to use controlled vocabularies. As this is a new procedure in data submission, it is currently not supported by all data repositories. However, if it is available, it is recommended to use the terminologies given and not to describe the data with one’s own words.
Search for your data: Once the data is available in the repositories’ search application, scholars are advised to check if they can find their data with various search terms. They should also review whether the data are accessible and all displayed information are correct. It is also recommended to repeat this checking from time to time as repositories might update or extend data presentations and/or metadata schemes used.
Get in touch with the repository: If scholars notice anything concerning their data, they should contact the archive. The staff at the repositories are probably grateful if attentive scholars give feedback on their submitted data or detect issues that hampers dataset retrieval.
As stated in our summary of the question analysis, the outcomes in Section “A - Information Needs in the Biodiversity Domain” are not a complete picture of search interests but only serve as a start for discussions with biodiversity researchers to further identify possible search categories.
Controlled vocabularies can only be used if appropriate terminologies exist. This is not the case for all topics. While there are numerous vocabularies for species, to the best of our knowledge, there is no vocabulary that allow the description of research methods and results. Scientific data types are also less considered in existing terminologies.
Another challenge lies in the automatic identification of relevant search topics in metadata. The text mining community has already developed various taggers and pipelines to extract organisms , chemistry items  or genes  from text. These annotations can support automatic facet or category creation. However, for important categories such as habitats, data parameters, biological and chemical processes or research methods, taggers are still missing. In order to increase semantic linkage of datasets and other resources such as publications, authors and locations, it would be a great benefit if the annotations also contain URIs to resources in controlled vocabularies. Then, dataset retrieval could be expanded on semantically related terms such as synonyms or more specific or broader terms.
An important point, however, are not standards, systems or vocabularies, but scholars themselves. Scholars need to be aware that thorough data descriptions are part of a good scientific practice. In order to preserve all kind of scientific data, independently of whether it has been used in publications or not, proper metadata in appropriate schemes are the key to successful dataset retrieval and thus, to data citation and data reuse. Data Repositories could offer data curation services to support scholars in describing research data and to encourage them to describe their data thoroughly. We are aware that it would require high efforts to introduce more domain-specific metadata schemes at generalist repositories; however, it would enhance dataset retrieval.
Computer science research can contribute to improvements for dataset search by developing methods and software tools that facilitate standard-compliant metadata provision ideally at the time of data collection, thus ensuring metadata standards to be actually used by data providers.
Scholarly search interests are as diverse as data are and can range from specific information needs such as searches for soil samples collected in a certain environment to broader research questions inspecting relationships among species. Our findings reveal that these search interests are not entirely reflected in existing metadata. One problem are general standards that are simple and mainly contain information that support data citation. Actual search interests can only be represented if keywords and suitable search terms are provided in general, non-specific fields that are provided in most standards, e.g., dc:subject in DublinCore. Most data repositories utilize these fields to enrich metadata with suitable search terms. However, if search interests are not explicitly given, facet creation, e.g., filtering over species or habitats, is more difficult. Full-text searches only return data if query terms match given keywords. On the other hand, even when scholars submit their data to a domain-specific repository that uses discipline-specific metadata standards, it does not guarantee that all search-relevant fields will be filled.
Data findability, one of the four FAIR principles , at least partially relies on rich metadata descriptions reflecting scholarly information needs. If the information scholars are interested in is not available in metadata, the primary data can not be retrieved, reused and cited. In order to close this gap, we propose checklists for data archives and scholars to overcome the current obstacles. We also highlight remaining challenges. In our future work, we would like to focus on a machine-supported extraction of relevant search categories in metadata as well as an automatic filling of metadata fields from primary data. That will minimize the metadata creation process and will support scholars and data repositories in producing proper and rich metadata with semantic enrichment.
We acknowledge the Collaborative Research Centre AquaDiva (CRC 1076 AquaDiva) of the Friedrich Schiller University Jena and the GFBio project (KO2209/13-2), both funded by the Deutsche Forschungsgemeinschaft (DFG). The authors would also like to thank the annotators and reviewers for their time and valuable comments.
-  (2017) Overview of the medical question answering task at trec 2017 liveqa. Technical report TREC LiveQA 2017. Cited by: User Interests.
-  (2011) Describing Linked Datasets With The VoID Vocabulary. Note: https://www.w3.org/TR/void/, accessed on 24.01.2019 Cited by: Dataset Search.
Introduction: named entity recognition in biomedicine. Journal of Biomedical Informatics 37 (6), pp. 393 – 395. Note: Named Entity Recognition in Biomedicine External Links: Cited by: Summary.
-  (2020) CRC AquaDiva. Note: http://www.aquadiva.uni-jena.de/, accessed on 12.01.2020 Cited by: Introduction, Questions:.
-  (2013) Assessment of user needs of primary biodiversity data: analysis, concerns, and challenges. Biodiversity Informatics 8 (2). Cited by: User Interests, Summary.
-  (2008) Modern information retrieval: the concepts and technology behind search. 2nd edition, Addison-Wesley Publishing Company, USA. External Links: Cited by: Figure 1, The Retrieval Process.
-  (2014-02) RDF schema 1.1. Note: https://www.w3.org/TR/rdf-schema/, accessed on 30.11.2019 Cited by: Dataset Search.
-  (2017-08) QUIS: in-situ heterogeneous data source querying. Proc. VLDB Endow. 10 (12), pp. 1877–1880. External Links: Cited by: Data Perspective.
-  (2019-08-24) Dataset search: a survey. The VLDB Journal. External Links: Cited by: Definitions, Definitions, Definitions.
-  (2012) DataONE: a distributed environmental and earth science data network supporting the full data life cycle. In EGU General Assembly 2012, held 22-27 April, 2012 in Vienna, Austria., p.11863, Cited by: Dataset Search.
-  (2009) Search engines: information retrieval in practice. 1st edition, Addison-Wesley Publishing Company, USA. External Links: Cited by: Evaluation in Information Retrieval, Evaluation in Information Retrieval.
-  (2018-03) Navigating the unfolding open data landscape in ecology and evolution. Nature Ecology & Evolution 2 (3), pp. 420–426. External Links: Cited by: Introduction, Introduction.
-  (2011) Text Processing with GATE (Version 6). University of Sheffield, Dept. of Computer Science. External Links: Cited by: Challenges.
-  (2011) Text Processing with GATE (Version 6). External Links: Cited by: Content Analysis:, Content Analysis:.
-  (2002) GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02), Cited by: Content Analysis:, Content Analysis:.
-  (2019-05) Quantifying fair: metadata improvement and guidance in the dataone repository network. Note: https://www.dataone.org/webinars/quantifying-fair-metadata-improvement-and-guidance-dataone-repository-network Cited by: Dataset Search.
-  (2019) Indexer Documentation. Note: https://github.com/DataONEorg/indexer_documentation, accessed on 20.11.2019 Cited by: Dataset Search.
-  (2014) Towards an integrated biodiversity and ecological research data management and archiving platform: GFBio. In Informatik 2014, Cited by: Dataset Search.
-  (2019) Note: https://datadryad.org/, accessed on 16th of May 2019 Cited by: Introduction.
-  (2013) Bridging the biodiversity data gaps: recommendations to meet users’ data needs. Biodiversity Informatics 8 (2). Cited by: User Interests.
-  (2019-04) A data citation roadmap for scholarly data repositories. Scientific Data 6 (1), pp. 28. External Links: Cited by: Methodology.
-  (2019) Note: https://figshare.com/, accessed on 16th May 2019 Cited by: Introduction, Dataset Search.
-  (1971) Measuring nominal scale agreement among many raters. Psychological Bulletin 76 (5), pp. 378–382. Cited by: Consensus of the Categories.
-  (2013-07) Content assessment of the primary biodiversity data published through gbif network: status, challenges and potentials. Biodiversity Informatics 8 (2). External Links: Cited by: OAI-PMH Harvesting.
-  (2018) GBIF Science Review 2018. Technical report https://doi.org/10.15468/VA9B-3048, accessed on 20.02.2019. External Links: Cited by: Introduction.
-  (2019) Search. Note: http://api.gbif.org/v1/occurrence/search, accessed on 30.11.2019 Cited by: Dataset Search.
-  (2020) Global Biodiversity Information Facility. Note: https://www.gbif.org/, accessed on 12.01.2020 Cited by: Introduction.
-  (2020) The German Federation For Biological Data. Note: https://www.gfbio.org, accessed on 12.01.2020 Cited by: Introduction, Questions:.
-  (2019) Note: https://developers.google.com/search/docs/guides/intro-structured-data, accessed on: 20.02.2019 Cited by: Dataset Search.
Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology 61 (1), pp. 29–48. External Links: Cited by: Consensus of the Categories.
-  (2017-03-15) BioFed: federated query processing over life sciences linked open data. Journal of Biomedical Semantics 8 (1), pp. 13. External Links: Cited by: Dataset Search.
-  (2011) Modern information retrieval. R. Baeza-Yates and B. Ribeiro-Neto (Eds.), pp. 257–340. External Links: Cited by: Evaluation in Information Retrieval.
-  (2011) Linked data: evolving the web into a global data space. Synthesis Lectures on the Semantic Web: Theory and Technology 1 (1), pp. 1–136. External Links: Cited by: For Data Repositories.
-  (2009-02) TREC genomics special issue overview. Information Retrieval 12 (1), pp. 1–15 (English (US)). External Links: Cited by: User Interests.
-  (2019) German Centre For Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig. Note: https://www.idiv.de,accessed on 11.04.2019 Cited by: Introduction, Introduction, Questions:.
-  (2009-11) Understanding PubMed® user search behavior through log analysis. Database 2009. External Links: Cited by: User Interests.
-  (2002-10) Cumulated Gain-based Evaluation of IR Techniques. ACM Trans. Inf. Syst. 20 (4), pp. 422–446. External Links: Cited by: Evaluation in Information Retrieval.
-  (2000) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. 1st edition, Prentice Hall PTR, Upper Saddle River, NJ, USA. External Links: Cited by: The Retrieval Process.
-  (2008) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. 1st edition, Prentice Hall PTR, Upper Saddle River, NJ, USA. External Links: Cited by: User Perspective.
-  (2018) Characterising dataset search—an analysis of search logs and data requests. Journal of Web Semantics. External Links: Cited by: Introduction, User Interests.
-  (2016) A terminology service supporting semantic annotation, integration, discovery and analysis of interdisciplinary research data. Datenbank-Spektrum 16 (3), pp. 195–205. External Links: Cited by: Dataset Search.
-  (2018-02-06) Semantic annotation of consumer health questions. BMC Bioinformatics 19 (1), pp. 34. External Links: Cited by: Introduction, User Interests.
-  (2013) Dataset retrieval. In Proceedings of the 2013 IEEE Seventh International Conference on Semantic Computing, ICSC ’13, Washington, DC, USA, pp. 1–8. External Links: Cited by: Dataset Search.
-  (1977) The measurement of observer agreement for categorical data. Biometrics 33 (1), pp. 159–174. External Links: Cited by: Consensus of the Categories.
-  (2017) Honey bee versus apis mellifera: a semantic search for biological data. In The Semantic Web: ESWC 2017 Satellite Events: ESWC 2017 Satellite Events, Portorož, Slovenia, May 28 – June 1, 2017, Revised Selected Papers, E. Blomqvist, K. Hose, H. Paulheim, A. Ławrynowicz, F. Ciravegna, and O. Hartig (Eds.), pp. 98–103. External Links: Cited by: Dataset Search.
-  (2017) What do biodiversity scholars search for? identifying high-level entities for biological metadata. In Proceedings of the 2nd Semantics for Biodiversity Workshop held in conjunction with ISWC2017, A. Algergawy, N. Karam, F. Klan, and C. Jonquet (Eds.), Vienna, Austria. External Links: Cited by: Categories:.
-  (2014) (ed.) San Diego CA: FORCE11, Data Citation Synthesis Group: Joint Declaration Of Data Citation Principles.. Note: https://doi.org/10.25490/a97f-egyk Cited by: Methodology.
-  (2014) Data Catalog Vocabulary (DCAT). Note: https://www.w3.org/TR/vocab-dcat/, accessed on 01/24/2019 Cited by: Dataset Search.
-  (2008) Introduction to information retrieval. Cambridge University Press, New York, NY, USA. External Links: Cited by: The Retrieval Process, Evaluation in Information Retrieval, Evaluation in Information Retrieval.
-  (2005-05-24) Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 6 (1), pp. S6. External Links: Cited by: Challenges.
-  (2018) Bioschemas & schema.org: a lightweight semantic layer for life sciences websites. Biodiversity Information Science and Standards 2, pp. e25836. External Links: Cited by: Dataset Search.
-  (2011-08) OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents. BioinformaticsTrends in Ecology & EvolutionBioinformaticsThe American StatisticianNature Ecology & EvolutionICEI 2018: 10th International Conference on Ecological Informatics - Translating Ecological Data into Knowledge and Decisions in a Rapidly Changing WorldResearch Ideas and OutcomesAdvances in BioinformaticsPeerJBMC BioinformaticsBMC Medical Research Methodology 27 (19), pp. 2721–2729. Cited by: Challenges.
-  (2011-10) OrganismTagger. Bioinformatics 27 (19), pp. 2721–2729. External Links: Cited by: Content Analysis:, Content Analysis:.
-  (2018) Scientific Data, Recommended Data Repositories. Note: https://www.nature.com/sdata/policies/repositories, access date: 18.12.2018 Cited by: Introduction, Methodology, For Scholars.
-  (2017-08) Results of the fifth edition of the bioasq challenge. In BioNLP 2017, Vancouver, Canada,, pp. 48–57. External Links: Cited by: Introduction, User Interests.
-  (2019) Data Publisher For Earth & Environmental Science. Note: https://www.pangaea.de/, accessed on 30.11.2019 Cited by: Introduction, Dataset Search.
-  (2019) Search. Note: http://ws.pangaea.de/es/portals/pansimple/_search, accessed on 30.11.2019 Cited by: Dataset Search.
-  (2016) Transparency in ecology and evolution: real problems, real solutions. 31 (9), pp. 711 – 719. External Links: Cited by: Introduction.
-  (2017-10) Essential annotation schema for ecology (ease)—a framework supporting the efficient data annotation and faceted navigation in ecology. PLOS ONE 12 (10), pp. 1–13. External Links: Cited by: Dataset Search, Categories:.
-  (2013-01) Expert team. Project deliverable Technical Report D3.1. External Links: Cited by: User Interests.
-  (2019) US National Library Of Medicine National Institutes Of Health. Note: https://www.ncbi.nlm.nih.gov/pubmed/, accessed on 30.11.2019 Cited by: User Interests.
-  (2016) How robust are multirater interrater reliability indices to changes in frequency distribution?. 70 (4), pp. 373–384. External Links: Cited by: Consensus of the Categories.
-  (2018-07) Environmental coupling of heritability and selection is rare and of minor evolutionary significance in wild populations. 2. External Links: Cited by: Introduction.
-  (2019) Data Discovery Interest Group. Note: https://www.rd-alliance.org/groups/data-discovery-paradigms-ig, accessed on: 20.2.2019 Cited by: Dataset Search.
-  (2018) Note: https://https://www.re3data.org,accessed on 21.11.2018 Cited by: Methodology, Table 2.
-  (2017-09) Information retrieval for biomedical datasets: the 2016 bioCADDIE dataset retrieval challenge. Database 2017. External Links: Cited by: User Perspective, User Interests, Questions:.
-  (2018, doi:10.17632/7j43z6n22z.1) A survey of current practices in data search services. Technical report Research Data Alliance Data (RDA) Discovery Paradigms Interest Group. External Links: Cited by: Introduction, Introduction, Introduction, Dataset Search.
-  (2008, https://doi.org/10.1038/nbt.1411-08) Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the mibbi project. Nature Biotechnology 889. Cited by: Dataset Search.
-  (2012) Applications of natural language processing in biodiversity science. 2012 (Article ID 391574), pp. 17 pages. External Links: Cited by: Summary, Content Analysis:.
-  (2014) An introduction to question answering over linked data. In Reasoning Web. Reasoning on the Web in the Big Data Era: 10th International Summer School 2014, Athens, Greece, September 8-13, 2014. Proceedings, M. Koubarakis, G. Stamou, G. Stoilos, I. Horrocks, P. Kolaitis, G. Lausen, and G. Weikum (Eds.), pp. 100–140. External Links: Cited by: Complexity, Definitions.
-  (2012-12) OWL Working Group, OWL 2 Web Ontology Language. Note: https://www.w3.org/TR/owl2-overview/, accessed on 12.11.2019 Cited by: Dataset Search.
-  (2016) The fair guiding principles for scientific data management and stewardship. Scientific Data 3 (160018). External Links: Cited by: Introduction, Conclusion, Methodology.
-  (2019) Note: https://zenodo.org/, accessed on 16.05.2019 Cited by: Introduction.
-  (2019) Search. Note: https://zenodo.org/api/records, accessed on 30.11.2019 Cited by: Dataset Search.