Knowledge Graph Embedding for Ecotoxicological Effect Prediction
Exploring the effects a chemical compound has on a species takes a considerable experimental effort. Appropriate methods for estimating and suggesting new effects can dramatically reduce the work needed to be done by a laboratory. In this paper we explore the suitability of using a knowledge graph embedding approach for ecotoxicological effect prediction. A knowledge graph has been constructed from publicly available data sets, including a species taxonomy and chemical classification and similarity. The publicly available effect data is integrated to the knowledge graph using ontology alignment techniques. Our experimental results show that the knowledge graph based approach improves the selected baselines.READ FULL TEXT VIEW PDF
Experimental effort and animal welfare are concerns when exploring the
We demonstrate the complementary natures of neural knowledge graph embed...
The polypharmacy side effect prediction problem considers cases in which...
The microstructure is an essential part of materials, storing the genes ...
This paper reconstructs the Freebase data dumps to understand the underl...
Nowadays, it is common in Historical Demography the use of individual-le...
Digital sources are more prevalent than ever but effectively using them ...
Knowledge Graph Embedding for Ecotoxicological Effect Prediction
Extending the scope of risk assessment models is a long-term goal in ecotoxicological research. However, biological effect data is only available for a few combinations of chemical-species pairs.111Chemical and compound are used interchangeably. Thus, one of the main efforts in ecotoxicological research is the design of tools and methods to extrapolate from known to unknown combinations in order to facilitate risk assessment predictions on a population basis.
The Norwegian Institute for Water Research (NIVA) is a leading Norwegian institute for fundamental and applied research on marine and freshwaters.222NIVA Institute: https://www.niva.no/en The Ecotoxicology and Risk Assessment programme at NIVA has through the last years developed a risk assessment system called RAdb.333NIVA Risk Assessment Database: https://www.niva.no/en/projectweb/radb This system has been applied to several case studies based on agricultural/industrial runoff into lakes or fjords. However, the underlying relational database structure of RAdb has its limitations when dealing with the integration of diverse data and knowledge sources. This limitation is exacerbated when these resources do not share a common vocabulary, as it is the case in our ecotoxicology risk assessment setting.
In this paper we present a preliminary study of the benefits of using Semantic Web tools to integrate different data sources and knowledge graph embedding approaches to improve the ecotoxicological effect prediction. Hence, our contribution to the NIVA institute is twofold:
We have created a knowledge graph by gathering and integrating the relevant biological effect data and knowledge. Note that the format of the source data varies from tabular data, to SPARQL endpoints and ontologies. In order to discover equivalent entities we exploit internal resources, external resources (e.g., Wikidata ) and ontology alignment (e.g., LogMap ).
We have evaluated three knowledge graph embedding models (TransE , DistMult  and HolE ) together with the (baseline) prediction model currently used at NIVA. Our evaluation shows a considerable improvement with respect to the baseline and the benefits of using the knowledge graph models in terms of recall and score. Note that, in the NIVA use case, false positives are preferred over false negatives (i.e., missing the hazard of a chemical over a species).
The rest of the paper is organised as follows. Section 2 provides some preliminaries to facilitate the understanding of the subsequent sections. In Section 3 we describe the use case where the knowledge graph and prediction models are applied. The creation of the knowledge graph is described in Section 4. Section 5 introduces the effect prediction models, while Section 6 presents the evaluation of these models. Finally, Section 7 elaborates on the contributions and discusses future directions of research.
Knowledge graphs. We follow the RDF-based notion of knowledge graphs  which are composed by RDF triples , where represents a subject (a class or an instance), represents a predicate (a property) and represents an object (a class, an instance or a data value e.g., text, date and number). RDF entities (i.e., classes, properties and instances) are represented by an URI (Uniform Resource Identifier). A knowledge graph can be split into a TBox (terminology), often composed by RDF Schema constructors like class subsumption (e.g., ncbi:taxon/6668 rdfs:subClassOf ncbi:taxon/6657) and property domain and range (ecotox:affects rdfs:domain ecotox:Chemical),444The OWL 2 ontology language provides more expressive constructors. Note that the graph projection of an OWL 2 ontology can be seen as a knowledge graph (e.g., ). and an ABox (assertions), which contain relationships among instances (e.g., ecotox:chemical/330541 ecotox:affects ecotox:effect/202) and semantic type definitions (e.g., ecotox:taxon/28868 rdf:type ecotox:Taxon). RDF-based knowledge graphs can be accessed with SPARQL queries, the standard language to query RDF graphs.
Ontology alignment. Ontology alignment is the process of finding mappings or correspondences between a source and a target ontology or knowledge graph . These mappings are typically represented as equivalences among the entities of the input resources (e.g., ncbi:taxon/13402 owl:sameAs ecotox:taxon/Carya).
Embedding models. Knowledge graph embedding  plays a key role in link prediction problems where the goal is to learn a scoring function .
is proportional to the probability that a tripleis encoded as true. Several models have been proposed, e.g., Translating embeddings model (TransE) . These models are applied to knowledge graphs to resolve missing facts in largely connected knowledge graphs, such as DBpedia . Embedding models have also been successfully applied in biomedical link prediction tasks (e.g., [3, 2]).
Evaluation metrics. We use (A)ccuracy, (P)recision, (R)ecall, () score to evaluate the models. They are defined as
where , , , and stand for true positive, true negative, false positive, and false negative
, respectively. Essentially, accuracy is the proportion of correct classifications. Recall is a measure of how many expected positive predictions were found by our model, and precision is the proportion of predictions that were correctly classified.
is a combined measure of precision and recall.gives equal weight, while favours precision and favours recall. Here we use ( in short) and .
As the above metrics all depend on a selected threshold, we also use area under the receiver operating characteristic (ROC) curve (AUC) to measure and compare the overall pattern recognition capability of the prediction models. ROC is the curve of true positive rate (, i.e., recall) and false positive rate (), with the threshold ranging from to using a small step. AUC is the area under this curve, its values range between and . Larger AUC indicates higher performance.
Ecotoxicology is a multidisciplinary field that studies the ecological and toxicological effects of chemical pollutants on populations, communities and ecosystems. Risk assessment is the result of the intrinsic hazards of a substance combined with an estimate of the environmental exposure (i.e., Hazard + Exposure = Risk).
The Computational Toxicology Program within NIVA’s Ecotoxicology and Risk Assessment section aims at designing and developing prediction models to assess the effect of chemical mixtures over a population where traditional laboratory data cannot be easily acquired.
Figure 1 shows the risk assessment pipeline followed at NIVA. Exposure is data gathered from the environment, while effects are hypothesis that are tested in a laboratory. These two data sources are used to calculate risk, which is used to find (further) susceptible species and the mode of action (MoA) or type of impact a compound would have over those species. Results from the MoA analysis are used as new effect hypothesis.
|LC50||Lethal concentration for of test population|
|EC50||Effective concentration for of test population|
|LOEC||Lowest observable effect concentration|
|NR-LETH||Lethal to of test population|
|LD50||Lethal dose for of test population|
The effect data is gathered during experiments in a laboratory, where the population of a single species is exposed to a concentration of a toxic compound. Most commonly, the mortality rate of the population is measured at each time interval until it becomes a constant. Although the mortality at each time interval is referred to as endpoint in the ecotoxicology literature, we use outcome of the experiment to avoid confusion. Table 1 shows the typical outcomes and their proportion within the effects data. To give a good indication of the toxicity to a species, these experiments need to be repeated with increasing concentrations until the mortality reaches . However, this is time consuming and is generally not done (sola dosis facit venenum). Hence, some compounds may appear more toxic than others due to limited experiments. Thus, when evaluating prediction models, (higher values of) recall are preferred over precision.
Risk assessment methods require large amounts of effect data to efficiently predict long term risk for the ecosystems. The data must cover a minimum of the chemicals found when analysing water samples from the ecosystem, along with covering species present in the ecosystem. This leads to a immense search space that is close to impossible to encompass in its entirety. Thus, it is essential to extrapolate from known to unknown combinations of chemical-species and suggest to the lab (ranked) effect hypothesis. The state-of-the-art within effect prediction are quantitative structure–activity relationship models (QSARs). These models have shown promising results for use in risk assessment, e.g., . However, QSARs have limitations with regard the coverage of compounds and species. These models use some chemical properties, but they usually only consider one or few species at a time. In this work we contribute with an alternative approach based on knowledge graph embeddings where the knowledge graph provides a global and integrated view of the domain.
Currently, the NIVA RAdb is under redevelopment, giving opportunities to include sophisticated effect prediction approaches, like the one presented in this paper, as a novel module for improving domain wide regulatory risk assessment.
Risk assessment involves different data sources and laboratory experiments as shown in Figure 1. In this section we describe the relevant datasets and their integration to create the Toxicological Effects and Risk Assessment (TERA) knowledge graph (see Figure 2).
We rely on the ECOTOXicology database (ECOTOX) . ECOTOX consists of tests (or experiments) derived from the literature. Currently, an ECOTOX test considers the effect of one of chemicals on one of species. Which implies that less than of compound-species pairs have been tested. The effect is categorised in one of a plethora of predefined outcomes. For example, the outcome implies lethal concentration for of the test population. Table 1 shows the most frequent outcomes in ECOTOX.
Table 2 contains an excerpt of the ECOTOX database. ECOTOX includes information about the compounds and species used in the tests. This information, however, is limited and additional (external) resources are required to complement ECOTOX.
The number of outcomes per compound and species varies substantially. For example, there are 1,881 experiments where the compound used is sulfuric acid, and 9,436 experiments where Pimephales promelas (fathead minnow) is the test species. The median number of experiments per chemical and species are and , respectively. Figure 3 visualises a subset of the outcomes, here the zero values are either no effect or missing. This figure shows certain features of the data, e.g., that compounds are more diversely used than species and that compound similarity is closely correlated to effects with regards to a species.
Currently, the ECOTOX database in used in risk assessment as reference data when calculating risk for a ecosystem. Essentially, comparing the reference and the observed chemical concentrations (per species). Since most compounds have multiple experiments per species, the mean and standard deviation of risk to a species can be calculated. However, if there is only one experiment for a compound-species pair we cannot calculate a standard deviation, such that the risk assessment is featureless. Therefore, estimating new effects is important to represent the natural variability of the effect data.
Figure 2 shows the different datasets and their transformation that contribute in the creation of the TERA knowledge graph. For example Triples (vii)-(ix) in Table 3 have been created from the ECOTOX effect data.
Each compound in the ECOTOX effect data has a identifier called CAS Registry Number assigned by the Chemical Abstracts Service. The CAS numbers are proprietary, however, Wikidata  (indirectly) encodes mappings between CAS numbers and open identifiers like InChIKey, a 27 character hash of the International Chemical Identifier (InChI) that encodes the chemical information in a unique manner. Hence, other datasets, such as PubChem , can be used to gather chemical features and classification of compounds. PubChem is already available as a knowledge graph and can be imported directly. However, the PubChem hierarchy only contains permutations of compounds. To create a full taxonomy for the chemical data, we use the ChEMBL SPARQL endpoint to extract the classification (provided by the ChEBI ontology ) for the relevant PubChem compounds. For example Triples (v) and (vi) in Table 3 come from the integration with PubChem and ChEMBL.
Aligning ECOTOX and NCBI. The species lineage in ECOTOX is not complete and therefore this (missing) information has been complemented with the NCBI taxonomy , a curated classification of all of the organisms in the public sequence databases (around of the species on Earth). The tabular data provided for the ECOTOX species and the NCBI taxonomies has been transformed into subsumptions and disjointness triples (see first four triples in Table 3). Leaf nodes are treated as instance entities.
Since there does not exist a complete and public alignment between ECOTOX species and the NCBI Taxonomy, we have used the LogMap [11, 12] ontology alignment systems to index and align the ECOTOX and NCBI vocabularies. ECOTOX currently only provides a subset of the mappings via its web search interface. We have gathered a total of ground truth mappings for validation purposes. The lexical indexation provided by LogMap left us with 5,472 possible NCBI entities to map to ECOTOX (we focus only on instances, i.e., leaf nodes). LogMap identified 4,681 (instance) mappings to ECOTOX ( of its entities) covering all mappings from the (incomplete) ground truth, thus, an estimated recall of . The mappings computed by LogMap have been included to the TERA knowledge graph as additional equivalence triples (see Triple (x) in Table 3 as example).
In this section we introduce the selected machine learning models to solve the effect prediction problem shown in Figure4. We use the known effects, denoted as Affects and Not affects in the figure, to predict whether or not new proposed chemical-species pairs are true (Affects) or false (Not affects).555
The models are implemented with Keras. Data and codes available from: https://github.com/Erik-BM/NIVAUC
Effect data sampling. A balance between positive and negative effect data samples is desired, therefore, we choose outcomes in categories (refer to Table 1): NOEL, LCp, LDp, NR-LETH, and NR-ZERO (p ranges from to ). We are only concerned about the mortality rate in experiments, consequently, we treat LC* and LD* identically. In addition, NR-LETH is treated as LC100. For simplicity, we treat the effects as binary entities. Hence, the outcome for a compound-species pair is defined as
For example, according to Figure 4, (i.e., affects ) and (i.e., does not affects ), while is unknown and thus a prediction is required for this chemical-species pair.
Knowledge graphs. We rely on the TERA knowledge graph (see excerpts in Table 3 and Figure 4) to feed the knowledge graph embedding algorithms. For simplicity we discard the ECOTOX species entities that have not a correspondence to NCBI. Note that we currently do not consider literals.
This (baseline) prediction model is based on the current prediction method used at NIVA. The basic idea of this method is to find the nearest-neighbour from the observed samples. In this context, the nearest neighbours are defined by hierarchy distance for species and similarity for compounds. Therefor, we first define a adjacency matrix for the taxonomy and a similarity matrix for compounds.
where is the taxonomy root, is the classes in the path from to , and
denotes the cardinality. One basic approach to calculate the chemical similarity is using the Jaccard index of the binary fingerprints of the compounds. Hence, the similarity matrix is defined as
We define a matrix , where and denote the set of compounds and species respectively. contains all the observed effects (training set):
We can then make the prediction with , , and , as shown in Algorithm 1. The algorithm terminates when neighbours are visited or .
Our second prediction model is a Multilayer perceptron (MLP) network withhidden layers. The model can be expressed as:
denotes vector concatenation.is the rectifier function and
is the logistic sigmoid function.are the weight matrices and are the biasses for each layer. are the embedded vectors of and . For example is defined as
is the one-hot encoded vector for entity, is an embedding transformation matrix to learn.
We have extended the MLP model () by feeding it with the TERA KG-based embeddings of (i.e., the chemical) and (i.e., the species), which encode the information of the taxonomy and compound hierarchies, among other semantic relationships. Note that the TERA knowledge graph also includes similarity triples about compounds. These triples represent pairs of compounds and where their similarity (as in Equation 7) is above a threshold .
The embeddings are learned by applying the scoring function from one of DistMult , HolE , and TransE . TransE was selected as it provides a very intuitive model. DistMult was included as it has shown state-of-the-art performance (e.g., ), while HolE was considered as it also encodes directional relations. The score function for DistMult is defined as
HolE uses a circular correlation score function, defined by
are the Fourier transform and its inverse,is the elementwise complex conjugate, denotes the Hadamard product. The final method is TransE, which has the score function
where is the norm of . , and are the vector representation for the subject, predicate and object of a triple, respectively.
DistMult and HolE optimises for a score of for positive samples and for negative samples. Moreover, TransE scores positive samples as and with no upper bound for negative samples. We modify the TransE score function to , such that and , to avoid modifying the labels.
The embeddings are used in the same network as the model. We train the embeddings and the classifier simultaneously using log loss and ADAGRAD. Training simultaneously will optimise the embeddings with regards to both the knowledge graph triples and the classifier loss.
Sampling. We split the effect data / for train/test. To prevent test set leakage, those training inputs that appear in the test set are removed, resulting in a split. can be trained with the entirety of the knowledge graph, which is ignored under effect prediction. The negative knowledge graph samples are generated by randomly re-sampling subject and object of a true sample, while maintaining the distribution of predicates. We generate four negative samples per positive sample.
model settings. We tested the performance of with 6 choices of nearest neighbour (, , , , , ). In addition to Algorithm 1, we tested an alternative technique for iterating over the data. However, Algorithm 1 yielded better results. The most balanced results were found when using neighbours. When using more than neighbours recall increases, but accuracy and precision suffer from a considerable decrease since the use of more neighbours increases the false positive rate.
model settings. The embedding dimension used in and was based on a search among sizes , , and . We found no difference between these parameters for , therefor, is chosen to aid faster training. used a larger amount of entities and needs a larger embedding space to capture the features of the data. The performance plateaued at , hence, this was chosen. The models (,) were trained until the loss stops improving for 5 iterations. For we used different loss weights for the embeddings and the effect predictor. These weights were chosen such that the embeddings and effects are learned at similar rates. DistMult and HolE used and as loss weights for embeddings and effects models, respectively, while TransE used equal weights. We used a dropout rate of and a similarity threshold of . Note that in we simultaneously train the embedding models and the effect predictor. We perform (i) 10 fold cross validation on the training set, and (ii) a clean test on the unseen test set. This test consist of a ensemble of 10 models trained on the training set, each with a new set of random negative knowledge graph samples. We used an ensemble to limit the impact the random negative samples has on the results.
Evaluation. Figures 4(a) and 4(b) and Table 4 show the results of the conducted evaluation for the five effect prediction models. Figures 4(a) and 4(b) visualise the impact on accuracy and recall with different thresholds on the - prediction scores, while Table 4 presents the relevant evaluation metrics with a threshold of for - and neighbours for . The results can be summarised as follows:
is only slightly better than random choice, as the prior binary output distribution is and . Thus it would not be appropriate for predicting effects. The false positive rate is also high, hence, would not be practical to use as a recommendation system.
is considerably better than and has balance between precision and recall. We suspect that this balance is due to random choice when the model has not previously seen a chemical or species. i.e., a prediction close to the decision boundary when an input is unseen will maintain the false negative/positive proportion, hence good for accuracy, not necessary for giving (interesting) recommendations to the laboratory.
Introducing the background knowledge to , in the form of KG embeddings gives higher recall, without loosing accuracy. In contrast to , is more uncertain when unseen combinations are presented to the model (in dubio pro reo). Therefore, is better suited to giving recommendations for cases where there is limited information about the chemical and the species in the effect data.
The best results in terms of recall, when using a threshold of (see Table 4), are obtained by with the embeddings provided by HolE ( points higher than the ).
As shown in Figures 4(a) and 4(b), lowering the decision threshold () would yield a higher recall () for the DistMult-based model, while maintaining the accuracy. TransE and HolE-based models have higher recall ( and ) at decision threshold , however, this comes at a cost of reduction in accuracy ( and ).
The highest overall score is , and is shared by all models, albeit, at different decision boundaries, , and for models with TransE, DistMult, and HolE embeddings, respectively.
We have created a knowledge graph called TERA that aims at covering the knowledge and data relevant to the ecotoxicological domain. We have also implemented a proof-of-concept prototype for ecotoxicological effect prediction based on knowledge graph embeddings. The obtained results are encouraging, showing the positive impact of using knowledge graph embedding models and the benefits of having an integrated view of the different knowledge and data sources.
Knowledge graph. The TERA knowledge graph is by itself an important contribution to NIVA. TERA integrates different knowledge and data sources and aims at providing an unified view of the information relevant to the ecotoxicology and risk assessment domain. At the same time the adoption of a RDF-based knowledge graph enables the use of (i) an extensive range of Semantic Web infrastructure that is currently available (e.g., reasoning engines, ontology alignment systems, SPARQL query engines), and (ii) state of the art knowledge graph embedding strategies.
Prediction models. The obtained predictions are promising and show the validity of the selected models in our setting and the benefits of using the TERA knowledge graph. As mentioned before, we favour recall with respect to precision. One the one hand, false positives are not necessarily harmful, while overlooking the hazard of a chemical may have important consequences. On the other hand, due to the limited experiments in terms of concentration (i.e., effect data may not be complete), some chemicals may look less toxic than others while they may still be hazardous.
Value for NIVA. The conducted work falls into one of the main research lines of NIVA’s Computational Toxicology Program (NCTP) to enhance the generation of hypothesis to be tested in the laboratory 
. Furthermore, the data integration efforts and the construction of the TERA knowledge graph also goes in line with the vision of NIVA’s section for Environmental Data Science. The availability and accessibility of the best knowledge and data will enable optimal decision making.
Novelty. Knowledge graph embedding models have been applied in general purpose link discovery and knowledge graph completion tasks . They have also attracted the attention in the biomedical domain to find, for example, candidate genes for a disease, protein-protein interactions or drug-target interactions (e.g., [3, 2]). However, we are not aware of the application of knowledge graph embedding models in the context of toxicological effect prediction.
Future work. The main goal in the mid-term future is to integrate the TERA knowledge graph and the machine learning based prediction models within NIVA’s risk assessment pipeline. In the near future, we intend to improve the current ecotoxicological effect prediction prototype and evaluate the suitability of more sophisticated models like Graph Convolutional Networks. The TERA knowledge graph will also be extended with additional information about species (e.g., interactions) and compounds (e.g., target proteins) which is expected to enhance the computed embeddings and the effect predictions.
Resources. The datasets, evaluation results, documentation and source codes are available from the following GitHub repository: https://github.com/Erik-BM/NIVAUC
This work is supported by the grant 272414 from the Research Council of Norway (RCN), the MixRisk project (RCN 268294), the AIDA project, The Alan Turing Institute under the EPSRC grant EP/N510129/1, the SIRIUS Centre for Scalable Data Access (RCN 237889), the Royal Society, EPSRC projects DBOnto, and . We would also like to thank Martin Giese and Zofia C. Rudjord for their contribution in early stages of this project.
Agibetov, A., Samwald, M.: Global and local evaluation of link prediction tasks with neural embeddings. In: 4th Workshop on Semantic Deep Learning (ISWC workshop). pp. 89–102 (2018)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res.12, 2121–2159 (Jul 2011)
Jimenez-Ruiz, E., Cuenca Grau, B., Zhou, Y., Horrocks, I.: Large-scale interactive ontology matching: Algorithms and implementation. In: the 20th European Conference on Artificial Intelligence (ECAI). pp. 444–449. IOS Press (2012)