Knowledge Graphs (KG) are knowledge representation structures commonly used to store complex structured or unstructured data. Graphs can be direct or indirect with vertices representing entities (words, entities or concepts) and edges representing relationships between these entities. A KG contains two forms of knowledge: relational knowledge and categorical knowledge. Relational knowledge encodes the relationship between entities while categorical knowledge encodes the attributes of entities.
Multi-relational graphs encodes data via knowledge databases, or semantic networks, and are widely used in the Semantic Web and for knowledge representation in bioinformatics (Gene Ontology, for instance) or natural language processing (WordNet). In these graphs facts are modelled as triples in the form (subject, predicate, object), where a predicate either models the relationship between two entities or between an entity and an attribute value. Note that any information of the KB can be represented via a triple or a concatenation of several triples. Such data sources are equivalently termed multi-relational graphs and they can be represented by a tensor, where each component represents an adjacency matrix of a single predicate.
Graph databases, such as Freebase [bollacker2008freebase], constitute rich repositories of annotated information that can be used for questions and answering applications. Simple or compositional questions can be formulated as, “What chemical can increase expression of protein X in cell Y given that it was exposed to a substance Z?
”. However, KG are very incomplete and sparsely connected. An elegant solution to solve the data incompleteness is using vector space representations and control the dimensionality of the vectors to obtain good generalization on new facts[yang2015embeddings]
For instance, the Word2vec model [mikolov], have been proposed to capture the semantics of words through the context - the principle is that words that are semantically similar should be closer to similar context words. The drawback of this approach is that the learned representations are mainly based on words, or entities, co-occurrences and cannot capture the relationship between two syntactically or semantically similar words if either of them yields very little context information.
In spite of their strong ability for representing complex data, multi-relational graphs are still complicated to understand, relations can be missing or invalid, there can be redundancy among entities because several nodes actually refer to the same concept, etc. Furthermore, most multi-relational graphs are very large in terms of numbers of entities and of relation types which make visualization knowledge representation hard - for example, Freebase contains more than 20 millions entities and billions of facts. Finally, most relations are localized on very few nodes - the so called fat-tail effect - making inference of new facts regarding most entities very hard.
Here we use a recent compositional methods for KG completion where the learning process is framed as the inference of new connections between nodes [compos]. This model addresses the problem of exploring relationships that goes beyond simple triples capturing chains of causality to generate and test hypothesis in the biomedical context.
We also address the problem of visualisation of information contained in these graphs, a complex task as some nodes may have thousands of edges (relations) of dozens of different types. The combinatorial explosion of the number of possible causality relations severely constrains the use of graphical display tools.
This work addresses these issues by applying a deep convolution network to visualize information contained in graphs with distributed representations. We demonstrate how to extract semantic fingerprints and show how they are useful for knowledge discovery and classification problems.
This paper is organised as follows: section 2 presents the model for embeddings and new facts discovery. Section 3 describes the data and Section 4 describes the SOM model, the semantic fingerprints obtained and some results. Section 5 describes the CNN model applied to the self-organized maps. Section 6 presents the an application to the protein compound prediction and Section 7 contains the conclusions and future work.
2 The compositional model for graph embeddings
Much work for knowledge graph completion was based on symbolic methods. These methods represent knowledge through simple algebraic operations but they are, in general, not tractable. Recently, a powerful approach for this task is to encode every element (entities and relations) of a knowledge graph into a low-dimensional embedding vector space. Among these methods, TransE [bordes2014qa] is a simple and effective one with state-of-the-art link prediction precision. It learns low-dimensional embeddings for every entity and relation in the KG where the basic idea is that every relation is regarded as a vectorial translation in the embedding space. For a triplet , the embedding is close to the embedding by adding the embedding vector , so that
. TransE is suitable for 1-to-1 relations, but less robust dealing with 1-to-N, N-to-1 or N-to-N relations. The model is training by minimising a loss function of how good new links are predicted in the test set.
Recently a compositional model was proposed [compos]. In this model the objective function to be minimised the following max-margin objective function:
where the parameters are the membership operator, the traversal operators, and the entity vectors:
The idea is to query not just simple triples but arbitrary complex ones from a set of path query training examples with path lengths ranging from 1 to . While most models are trained only for objectives functions with queries of path length 1, this objective function takes into account extended composed paths.
2.1 TransE model
There are many possible implementations of and , but we will use TransE [bordes2013translating] due to its simplicity and performance on knowledge base completion.
Given a training set of triples composed of two entities , (the set of entities) and a relationship (the set of relationships), the model learns vector embeddings of the entities and the relationships that minimizes a given loss function. TransE works on the idea that a relation induced by the edges can be captured by a translation of the embeddings. It uses the scoring function:
where , and are all -dimensional vectors and the model membership operator is defined as:
and the traversal operator . TransE can handle a path query using
We use Stochastic Gradient Descent (SGD)[duchi10adagrad] to optimize
, which is a non-convex objective function. We initialise all parameters with i.i.d. Gaussians of variance 0.25 and use a mini-batch size of 100 examples, and a step size in(chosen via cross-validation). For each example , we sample 10 negative entities . As suggested by [compos], during training, all of the entity vectors are constrained to lie on the unit ball, and we clipped the gradients to the median of the observed gradients if the update exceeded 3 times the median. The dimensionality of the encodings was set to , a margin of 1 and we use the L2 metric.
The models were implemented based on Theano libraries[bastien2012theano] that are very fast and scalable.
3 Data Description and Training
We will use two knowledge base completion datasets consisting of single-edge queries, a subset of WordNet [socher2013reasoning] and CTD [ctd], as described in Table 1. The original CTD, contains around 575 150 nodes (genes, chemicals, diseases, proteins, etc) and 2 965 279 edges. For this work we only consider relations between compounds (chemicals) and genes - mostly, but not all, coding for proteins. This information was extracted from manual annotations of around a hundred thousand scientific publications.
|Entities||38 696||24 382|
|Train||112 581||316 321|
|Test||10 000||30 000|
CTD is very different from WordNet since it is a bipartite graph between compounds and genes and in WordNet both head and target entities are arbitrary words. In WordNet relations can be reversed but not in CTD. There are about 1700 different types of relationships between compounds and genes. We only consider the ones with more than 1000 facts, ending up with 42 relations types and 316 321 facts used for training.
Following [compos] we generate path queries by performing random walks on the graph thus generating an extended set of auxiliary training data.
We start from the training graph formed by edges in the training set. New training examples are generated as follows:
Uniformly sample a path length , and uniformly sample a starting entity .
Perform a random walk beginning at entity and continuing steps.
At step of the walk, choose a relation uniformly from the set of relations incident on the current entity .
Choose the next entity uniformly from the set of entities reachable via .
Output a query-answer pair, , where and is the final entity of the random walk.
Paths of length 1 were not sampled and we added all of the edges from to the dataset.
4 The SOM visualisation
Self-Organized Maps (SOM) is an algorithm for supervised or unsupervised clustering useful to represent high-dimensional data into a topographic two-dimensional map. SOM were introduced by Kohonen[som]
as an unsupervised learning process that learns the distribution of a set of patterns without any class information. A pattern is projected from an input space to a position in the map and information is coded as the location of an activated node. The advantage of SOM is that it provides a topological ordering of the classes where similarity is preserved and its useful for classification of data which a large number of categories - as is our case where is difficult to define class boundaries. A SOM can be seen as a dimensionality reduction technique that use a neighbourhood function to preserve the topological properties of the input space.
A SOM operates in two modes: training and mapping. ”Training” builds a map with input examples, while ”mapping” classifies the unseen input vectors. In our case we use SOM to aggregate the embeddings obtained from the compositional TransE model.
The advantage of SOM over other methods, like t-SNE, is that is simpler to interpret the results and it lends itself to a more flexible representation. Next section explains how these maps can be interpreted and their usefulness in the biomedical context.
We used a rectangular 50x50 map with circular boundary conditions. Figure 2 describes the codevectors of the compounds embeddings for the CTD database using a embedding dimension of and the quality of each node.
In Figure 2 we group the SOM cells into five categories, identified by colours. Each region correspond to a specific pattern of interaction between a set of genes and compounds. We can see that there are some well defined compact segments while others have a more scattered pattern (green). The explanation of these clusters will be detailed below.
In order to verify the results, we aggregate the SOM code-vectors into five clusters and project the interactions of the compounds with a set of genes aggregated by these clusters - see Table 2.
Ir order to access the quality of the projection, for a specific cell in the SOM map we select the chemicals associated with it, and the genes that, in the TransE model, has a short distance. From the CTD database we them select the corresponding genes associated with these compounds and extract all the genes that have interactions with these chemicals. Based on these entities, we build a sub-graph and evaluate the Jaccard distances of all the chemical . We run an algorithm for all the cells in the SOM and compute a global ratio for evaluate the semantic capability of the clustering just obtained. The value was 1.37, which contrast with the initial value of 0.0031, a factor of about 400 - all type of interactions were considered. This shows that the aggregation made by the SOM has semantic meaning and is informative about the interactions in the original graph.
For similar genes interaction patterns we expect similar mappings, as is fact the case for IL10 (interleukin 10) and EDN1 (endothelin 1). Note that this does not implies that the two genes are similar, only that they have a similar interaction pattern.
4.1 From fingerprints to drug discovery
Now that we know that the SOM represents meaningful information, we can go an extra step in terms of interpreting the results and help the researcher visualizing the data and test new hypothesis.
The traditional concept that drugs exert their activities by modulating one target of particular relevance to a disease has guided the pharmaceutical industry in recent decades. However, there is evidence that this process is incomplete since many drugs do interact with multiple targets [ref7]. Furthermore, a significant number of chemical compounds have failed to get approved due to serious clinical side-effects observed during later-stages . The lesson in that multi-target interactions of drugs are largely unknown and poorly understood.
For a drug to have the desired effect we need to modulate a set of targets to achieve efficacy, while avoiding others to reduce the risk of side effects. By considering all types of interactions between compounds and proteins, our method can be useful for the development multi target drugs.
Since we know the components involved in each disease, we can create its unique fingerprint based on the activation level of each code-vector in respect to the genes involved in the disease. In Figure 4 we plot the fingerprint for the lung and ovary cancer, i.e., genes that are involved in onset of the disease. We quantise the distances to the code-vectors into colours (red an euclidian distance below 0.1, green a distance between 0.1 and 0.2). All other higher distances were not consider.
The SOM fingerprint is helpful for visualizing the differences between the sets using distances between molecular fingerprints of the molecules. This technique clusters compounds (or genes) with similar interaction patterns with each other in the best matching cell while also maintaining a 2-dimensional grid of cells such that similar compounds or genes (depending on the mapping being used) appear in adjacent cells. Note that this pattern is not necessarily related to the structure or function of the genes/compounds being considered.
5 Analysis of SOM maps with CNN
SOM fingerprints are very useful for visualization but they are limited in terms of abstract features extraction. In this section we will apply the findings learnt to build a supervised model for protein docking problem using a Convolutional Neural Network. CNN are powerful neuronal networks specially designed to capture invariant features in images setting the state-of-the-art in terms of image classification[cnn]. They also have a very interesting property in terms of capturing abstract representation of high-dimensional data. We will apply them to a very well known and important problem in structural biology: protein docking. We call this combination of SOM fingerprints and supervised CNN the SOME model.
This problem is in general ill-posed and not sufficiently constrained: many models can fit the data thus achieving poor generalization. Convolutional networks (CNN) incorporate hard constraints on learning and are good at detecting invariants on data, either to translation or deformation, which make them particularly useful for image analysis. They use three basic concepts: i) local receptive fields, ii) weight sharing, and iii) spatial subsampling.
The network we will use consists of a set of layers, each of which contains one or more planes. Normalized images enter at the input layer and each unit receives input from a small neighborhood in the planes of the previous layer. The weights forming the receptive field for a plane are forced to be equal at all points in the plane. Each plane can be considered as a feature map which has a fixed feature detector that is convolved with a local window which is scanned over the planes in the previous layer. Multiple planes are usually used in each layer so that multiple features can be detected. These layers are called convolutional layers. Once a feature has been detected, its exact location is less important. Hence, the convolutional layers are typically followed by another layer which does a local averaging and subsampling operation (e.g., for a subsampling factor of two: where is the output of a subsampling plane at position and is the output of the same plane in the previous layer). The network is trained with the usual Stochastic Gradient Descendent.
5.1 SOME model
A high-level block diagram of the system proposed for protein docking prediction is shown in Fig 4 .
The system works as follows:
first we learn the genes and compounds embeddings from the CTD database using the compositional TransE model.
then we learn a SOM map that quantizes the dimensional input vectors into a sparse representation of 2500 topologically organized values - the fingerprints, or feature vectors.
then, a fixed size window is used for the SOM map and local ”image” samples are extracted. The window is moved over the map at each step.
the training samples (compound + gene) are passed through the SOM at each step, thereby creating new training and test sets in the output space.
Finally a convolutional neural network is trained on the transformed training/test set using supervised learning on the protein docking prediction.
For the SOM, training is split into two stages an ordering phase and a fine adjustment phase. In the first phase 10 000 updates are executed, and 5 000 in the second. The learning rate during this phase is 0.5(1-) where is the current update number, and the total number of updates.
The idea of using CNN to analyse graphs embeddings projected into SOM maps is particularly interesting as it allows to extract abstract features from the high dimensional data, representing invariants that are hard to spot in other formats.
We tested the compositional TransE model for link prediction and the SOME model for: i) consistency of clustering and ii) compound-target affinity prediction.
6.1 Compositional TransE model
In the first case we predict the fraction of new relations that are correctly predicted against a random set of random relations. In Table 3 we present the results. The compositional model show considerable gain in respect to the single-node training, either for WordNet and for CTD. We used as metric the hits at top 10 (percentage of correct answers ranked in the top 10 predicted answers). On CTD improvement is even more remarkable on test set.
|Path query task||WordNet||CTD|
6.2 SOME results
Now we apply the SOME model to the prediction of protein ligand. This is a supervised learning problem where the objective is evaluate the ligand affinity of a specific compound to a protein. Normally this problem is addressed taking into consideration the structure of the molecule and the protein.
For the SOME model we used a 50x50 grid where each cell is represented by the distance of the entities (head or tail) embeding to the the respective cell code-vectors. In this case we have two arrays: one for the proteins and one for the chemicals. For the chemicals we get an average of 2.2 compounds per cell and for the proteins/genes and average of 8.3 genes per SOM cell.
The CNN was trained using Keras framework (keras.io), a high level framework based in Theano libraries. The following configuration was used
8 24 input layer
convolutional layer with 71 3 3 filters with tanh activation
2 2 max-pooling
convolutional layer with 88 2 2 filters with tanh activation
2 2 max-pooling
a fully connected layer of size 26 with tanh activation
The number of filters and the size of the fully connected layer were chosen using the method suggest by Snoek [snoek]
. The cross entropy was used as the loss function and we used ReLU transfer function.. A dropout of 0.5. The convolutional network has six layers and for classification we use the softmax transformation. The network was trained with SGD for a total of 150 epochs. As inputs of the CNN we used two SOM fingerprints maps; the protein and the compound.
We compare our model with a recent model called CSNAP [review] (Chemical Similarity Network Analysis Pulldown). This method address the problem that most target identification methods are limited to single ligand analyses. This method clusters diverse chemical structures into distinct sub-networks corresponding to chemotypes to improve target prediction accuracy achieving considerable improvement over traditional methods.
We tested our SOME model in the prediction of targets in a subset of annotated compounds. We used 100 ligands from 6 target-specific drug classes with known target annotations as a way to validate the method. As in [review]
we used a chemical search criteria with a Z-score cutoff = 2.5 and a target class point of 0.85. The overall prediction accuracy was 85% which is slightly below the 89% accuracy obtained by CSNAP. Note, however, that no information about chemical properties was used - just the embeddings. The performance of SOME could be improved if we enriched the inputs with contextual chemical information about the compounds, in the same line of[review].
We proposed SOME, a graph completion and visualisation algorithm and applied it to biomedical data. SOME allows exploration of KG in a semantic meaningful representation and process queries that are impossible in traditional frameworks. For instance ”Chemical+Gene-Disease+Disease = ?”, or ”Chemical is to Gene as Chemical is to ?”.
Fingerprint matching is very useful to explore the high dimensional data since every entity can be projected in the global semantic space in the SOM map, thus producing an unique activation pattern. We can simply add or remove features (pixels) to one entity and see what is the implication in terms of relations other entities (diseases, for instance).
We showed that the visual exploration model proposed achieves comparative performance in protein docking problem using much less information and completely abstracting the chemical nature and composition of both elements.
As a future work we would like to explore the hierarchical clustering of SOM to allow the user to navigate through several levels of granularity when exploring the data. For the particular case of biomedical data, the inputs could be enriched with chemical context. We also would like to include full semantic context for diseases involving several genes so that the system may extract integrated approach.