Enabling Semantic Data Access for Toxicological Risk Assessment

08/27/2019 ∙ by Erik Bryhn Myklebust, et al. ∙ 0

Experimental effort and animal welfare are concerns when exploring the effects a compound has on an organism. Appropriate methods for extrapolating chemical effects can further mitigate these challenges. In this paper we present the efforts to (i) (pre)process and gather data from public and private sources, varying from tabular files to SPARQL endpoints, (ii) integrate the data and represent them as a knowledge graph with richer semantics. This knowledge graph is further applied to facilitate the retrieval of the relevant data for a ecological risk assessment task, extrapolation of effect data, where two prediction techniques are developed.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Data preparation and integration

We have created four APIs for wrangling and incorporating effect, taxonomy, and chemical data into the TERA knowledge graph. Excluding the SPARQL endpoints 101010Wikidata: https://query.wikidata.org/sparql
    ChEBI: https://www.ebi.ac.uk/rdf/services/sparql
, the data can be downloaded from the sources websites 111111ECOTOX: https://cfpub.epa.gov/ecotox/
    PubChem: https://pubchemdocs.ncbi.nlm.nih.gov/downloads
    NCBI Taxonomy: https://www.ncbi.nlm.nih.gov/guide/taxonomy/

Species API. This API uses data from various tabular sources to describe the species taxonomy and features.

  1. Including the NCBI Taxonomy in the knowledge graph is split into several sub-tasks.

    1. Loading the hierarchical structure found in nodes.dmp. The columns of interest are the taxon identifiers of the child and parent taxon, along with the rank of the child taxon and the division where the taxon belongs. We use this to create triples like (v) and (vi) in Table 3.

    2. To aid alignment between NCBI and ECOTOX identifiers, we add the synonyms found in names.dmp. Here, the taxon identifier, its name and name type are used to create triples similar to (vii) in Table 3. Note that a taxon in NCBI can have a plethora of names while a taxon in ECOTOX usually have two (common and Latin).

    3. Finally, we add the labels of the divisions found in divisions.dmp. In addition, we add disjointness axioms among all divisions, e.g., Triple (ii) in Table 3.

  2. The EOL data can be downloaded as a uniform format regardless of the relation to be added to the knowledge graph. Therefore, our approach is universal for endemic, habitat, and other data from EOL.

    1. Each dataset for a EOL relation contains a glossary.tsv and a data.tsv file.

    2. The glossary is used to convert the strings (columns Measurement Type and Value) given in the data to URIs. We map the identifiers given in the data to NCBI URIs (see data alignment API section) and create triples using the NCBI URI as subject, with Measurement Type and Value (from EOL) as predicate and objects, as shown in Triples (viii) and (ix) in Table 3.

    3. In addition, EOL gives hierarchies for the Measurement Values in a two column format with parent and child node. Therefore, we can simply add subsumption axioms using these child-parent pairs, as shown in Triple (xvii) in Table 3.

# subject predicate object
(i) ecotox:group/Worms owl:disjointWith ecotox:group/Fish
(ii) ncbi:division/2 owl:disjointWith ncbi:division/4
(iii) ncbi:division/2 rdfs:label ‘‘Mammals’’
(iv) ecotox:taxon/34010 rdfs:subClassOf ecotox:taxon/hirta
(v) ncbi:taxon/687295 rdfs:subClassOf ncbi:taxon/513583
(vi) ncbi:taxon/687295 ncbi:rank ncbi:Species
(vii) ncbi:taxon/687295 ncbi:scientificname ‘‘Coleophora cornella’’
(viii) ncbi:taxon/35525 eol:habitat ENVO:00000873
(ix) ncbi:taxon/35525 eol:presentIn worms:Oostende
(x) compound:CID10198308 rdf:type obo:CHEBI_134899
(xi) obo:CHEBI_134899 rdfs:subClassOf obo:CHEBI_37919
(xii) compound:CID10198308 pubchem:formula ‘‘’’
(xiii) ecotox:effect/001 ecotox:compound ecotox:chemical/115866
(xiv) ecotox:effect/001 ecotox:species ecotox:taxon/26812
(xv) ecotox:effect/001 ecotox:endpoint ecotox:LC50
(xvi) ecotox:taxon/33155 owl:sameAs ncbi:taxon/311871
(xvii) eol:freshwaterPond rdfs:subClassOf ENVO:00000033
Table 3: Example triples from the TERA knowledge graph

Chemical API. The combination of RDF and SPARQL endpoints form the basis for the chemical API:

  1. The downloaded turtle files (standard format to store RDF graphs) from PubChem can be directly used as they already include RDF triples. Triple (x) in Table 3 is an example from these files.

  2. To complete the class hierarchy where PubChem provides the bottom level, we query the ChEBI SPARQL endpoint using the query shown in Listing 1. Here, we use the values found in <current>, to find superclasses that have an edge of type rdfs:subClassOf or rdf:type to <current>. The query is iterated replacing <current> with the superclasses resulting from the query (<current> can also be replaced with a list). Triple (xi) in Table 3 shows an example of result of this query.

  3. Since the chemical data is much larger than any of the other data sources used, we do not load chemical features on initialization, but upon request. We use the PubChemPy (pubchempy) library to query the PubChem REST API. Triples such as (xii) in Table 3 is a results of an API request.

SELECT ?class {
    VALUES ?s { <current> }
    ?s rdfs:subClassOf | rdf:type ?class .
    FILTER (!isBlank(?class))
Listing 1: Query superclasses from ChEBI.

Effect API The tabular data in ECOTOX requires significantly more cleaning than the other data.

  1. ECOTOX contains metadata about the species and compounds used in the experiments. We use this information to aim alignment between the effect and the background data.

    1. Species metadata in species.txt include common and Latin name, along with a (species) ECOTOX group. This group is a categorization of the species based on ECOTOX use cases. We filter the species names, e.g., sp., var. (i.e., unidentified species and variant) are removed along with various missing value short hands used in the metadata.

    2. The full hierarchical lineage is also available in the species.txt file. Each column represent a taxonomic level, e.g., genus or family. If a column is empty, we construct a intermediate classification, e.g., say Daphnia magna has no genus classification in the data, then its classification will be Daphniidae genus (family name + genus, actually called Daphnia). We construct these classifications to ensure the number of levels in the taxonomy is consistent. This consistency will help when aligning to the NCBI data. Note that when adding triples such as (iv) in Table 3, we also add a classification based on the column to aid easier querying for a specific taxonomic level.

    3. Chemical metadata in chemicals.txt is handled similarly, the data includes chemical name and a (compound) ECOTOX group.

  2. The effect data consist of two parts, a test definition and results associated with that test. Note that a test can have multiple results.

    1. The important aspects of a test is the compound and the species used, other columns include metadata, but these are optional and often empty. Each result gives an endpoint (see Table 1), an effect (e.g., chronic or mortal), and a concentration and unit at which the endpoint and effect where recorded.

    2. We construct a node of type result and link each results property to it, examples can be seen in (xiii)-(xv) in Table 3.

Data alignment API. We use various techniques to align the datasets described above.

ECOTOX-NCBI (Species). There does not exist a complete and public alignment between ECOTOX species and the NCBI taxonomy. Therefore, we have used the LogMap (logmap2011; logma_ecai2012) ontology alignment system to align the two vocabularies. There exists a partial mapping curated by experts thru the ECOTOX search interface121212https://cfpub.epa.gov/ecotox/search.cfm, we have gathered a total of 929 mappings for validation purposes. LogMap’s lexical indexation gave us 5,472 possible NCBI entities to map to ECOTOX. Around of the ECOTOX (instance) vocabulary was mapped to NCBI covering all 929 expert curated mappings. Hence, an estimated recall of . The TERA knowledge graph include the LogMap mappings as additional equivalence triples, e.g., Triple (xvi) in Table 3.

EOL-NCBI (Species). To be able to use the EOL data we need to align the EOL identifiers with NCBI, this can be done through Wikidata as shown in query in Listing 2. This query use the Wikidata properties instance of (wdt:P31), Encyclopedia of Life ID (wdt:P830), and NCBI Taxonomy ID (wdt:P685), along with the class taxon (wd:Q16521).

SELECT ?species ?ncbi ?eol WHERE {
        ?species wdt:P31 wd:Q16521 ;
                 wdt:P830 ?eol ;
                 wdt:P685 ?ncbi .
Listing 2: EOL and NCBI identifiers.

ECOTOX-PubChem (Compounds). To enable the interaction between the Chemical API and the effect data we create a mapping between CAS and InChIKey using the SPARQL query shown in Listing 3 on the Wikidata endpoint. This query use the Wikidata properties and classes wdt:P31, wdt:P235, wdt:P231, and wd:Q11173, which has labels instance of, InChIKey, CAS Registry Number, and chemical compound.

SELECT DISTINCT ?compound ?inchikey ?cas WHERE {
    ?compound wdt:P31 wd:Q11173 ;
              wdt:P235 ?inchikey ;
              wdt:P231 ?cas .
Listing 3: Compound CAS and InChIKey identifiers.

Requesting chemical features from PubChem requires us to convert InChIKeys to CIDs, fortunately this mapping is available through the PubChem REST API, an example request using PubChemPy is shown in Listing 4.

from pubchempy import get_compounds
r = get_compounds(inchikey,”inchikey”)
r = [c.to_dict(properties=[’cid’]) for c in r]
cid = [c[’cid’] for c in r] # 4284
Listing 4: Converting from InChIKey to CID for the compound DEET using PubChemPy.

2 Data access

For data access we can use either the APIs or SPARQL directly. The output will depend on the required task, and can be given either as a graph or in a tabular format.

APIs. In addition to SPARQL queries for extracting data from the knowledge graph are methods which enable access to the data without being proficient in SPARQL131313Methods are, for the most part, abstractions of SPARQL queries., but rather prefer a scripting language (i.e., Python).

  1. In addition to classification, sibling, and name queries, the Species API has methods for fuzzy querying of identifiers based on close matched names. This is a necessary feature, since the name definition may vary from user to user.

  2. Since the Chemical API use the most varied sources, we need to convert between them, therefore, the API can convert between CAS, InChIKey, PubChem ID (called CID) and internal identifiers to interact with the NIVA internal data bases. If these identifiers are not sufficient the user can query Wikidata directly.

  3. As mentioned, the chemical features are not included in the knowledge graph, purely for practical reasons. Therefore, fetching features from PubChem is a method in the API. We also include methods for other properties available in PubChem, such as chemical fingerprints, which is a string of bits representing the presence or absence of selected chemical properties.

  4. The Effect API has several methods for mapping between species identifiers (complementing LogMap mappings). These methods use the species names to query the Wikidata SPARQL endpoint and fetch the mappings between identifiers.

A case study. For researchers competent in SPARQL the most powerful method for accessing data in TERA is queries. We will here give an example of the usability of TERA in extracting data for a risk assessment case study.

The first step in a risk assessment is to define a case study, in Listing 5 we define our study area as the lake Langtjern. Thereafter, we can extract the compounds and concentrations, at which, the species in the lake experiences lethal effects. The concentrations can then be compared with water samples (exposure) from Langtjern to see if the endangered species are under threat of going extinct141414The comparison can be done with another (case study) API. However, this uses only private data and therefore is not included here..

SELECT ?s ?c ?conc WHERE {
    ?s      eol:habitat eol:Freshwater ;
            eol:presentIn [rdfs:label Langtjern] ;
            eol:conservationStatus eol:endangered .
    []      rdf:type ecotox:Result ;
            ecotox:endpoint ecotox:LC50 ;
            ecotox:effectType ecotox:ACUTE ;
            ecotox:compound ?c ;
            ecotox:concentration ?conc ;
            ecotox:species ?s .
Listing 5: Example query for selecting all species, compounds, and concentrations, where the species is endangered and lives in the freshwater lake Langtjern.


This work is supported by grant 272414 from the Research Council of Norway (RCN), the MixRisk project (RCN 268294), the AIDA project, The Alan Turing Institute under the EPSRC grant EP/N510129/1, the SIRIUS Centre for Scalable Data Access (RCN 237889), the Royal Society, EPSRC projects DBOnto, and , and is organized under the Computational Toxicology Program at NIVA. We would also like to thank Martin Giese and Zofia C. Rudjord for their contribution in different stages of this project.