Knowledge graphs  are knowledge bases which represent domain knowledge as interlinked entities, forming the nodes of a graph. Driven by the recent ‘explosion’ of data, many corporations and academic institutions are relying on knowledge graphs for the modelling and analysis of large amounts of data. In this work, we use graph embeddings [7,12] to predict possible relationships between drugs in a drug knowledge base represented as a knowledge graph. Finally, we intend to rely on graph embeddings to provide relevant information and predictions on the given database, as well as to assist us in performing the tasks of relation predictions using link prediction and drug-drug similarity utilising the node similarity concepts mentioned in this work. Because of the reliance on expensive medical equipment and medical professionals to deal with nuances of the equipment and other activities, the field of drug similarity and analysis is time consuming and expensive. We aim to achieve very good results by employing knowledge graphs to develop drug-similarity and predictions in less time and at a lower cost than previous techniques. Our drug similarity model developed will help in drug similarity discovery which will help reduce the side effect  caused by the use of alternative similar drugs.
2 Related Work
. Typically, biological knowledge graphs are built using manually selected datasets such as MIMIC-iii, ICD-9, and others. Other options include using natural language processing to lessen the work of manual information collection. Today, knowledge graphs are used in a variety of biomedical applications such as Genomics, Proteomics, Drug Side Effects, Drug Repurposing, Safe Drug Recommendation, and many more, indicating their popularity in this field. The research  explores using representation learning in knowledge graphs for the study of drug target predictions and drug-drug interactions and  uses knowledge graph embeddings to perform link predictions and drug target discovery that too using the KEGG database. The work done in research  utilizes the use of LSTM and knowledge graph embeddings based model to predict drug-drug interaction. A section of our work inspired from the works like  which deals with making inferences using the learned models. The research explore various graph embedding model which has been described and compared in ,  and .
3 KEGG Database
KEGG (Kyoto Encyclopedia of Genes and Genomes)  is a knowledge base that contains genetic, chemical, and functional information. It contains various entities such as disease, gene, network, route, drug, and so on, and each entity type is linked to the others via a certain form of relationship as shown in the image below. KEGG was developed by Kanehisa Laboratories111Kanehisa Lab Website https://www.kanehisa.jp/ and is structured as a network of interconnected entities that resembles the biological ecosystem at the molecular level . The database compromises ﬁve types of entities. The first one is drugs which are comprehensive drug information site that exclusively includes pharmaceuticals that are approved in Europe, Japan, and the United States, includes information about drugs such as molecular interactions, drug metabolism, and chemical structure. The second is genes which involves an amalgamation of genes and proteins, contains information about gene sequences and their interaction with other biological entities . The next are diseases, collection of single-gene disorders, multifactorial diseases, and infectious diseases with a focus on perturbation, which deals with disease interactions with other entities. Pathways comprises manually selected pathway maps containing data on metabolisms, biological processes, human diseases, drug development, and other topics. Each route is linked to entities such as diseases, medications, and genes . By linking distinct entities in other KEGG databases based on this network database, the KEGG Network database represents information about medications and disorders in the form of molecular networks.
4.1 Knowledge Graph and Graph Embeddings
A knowledge graph is defined as ”a graph of data intended to accumulate and convey knowledge of the real world, whose nodes represent entities and whose edges represent relations between these entities.” . Formally, a knowledge graph consists of triplets, consisting of two entities (head, tail) connected or related using edges (relationship). The triplets are represented as (head,relationship, tail) or (h,r,t) with h,t E, r
R where E is a set of entities and R represents set of all possible relationships that can exist between any two entities. For example, KEGG contains instances like (D11034, DRUG_EFFICACY_DISEASE, H00409) which represents the drug and disease are connected by the specific relationship in the database. Embedding learning is an efficient way to tackle data sparseness by representing knowledge graph’s entities and relationship as low dimensional real value vectors storing the structural characteristics of the graph structure within themselves. In research 
, the authors have divided the graph embedding models in three families, the first translation distance based which utilizes distance based scoring methods, the second ones are semantic matching based which rely on similarity based scoring methods and finally the third family of neural network based models.
4.2 Link Prediction
The task of predicting the connection between any two nodes based on their node properties is referred to as link prediction. The link prediction task in graph G, where each node represents some information, is to develop a model that predicts whether two nodes are related by an edge or not [14, 15]
.This can be used to find additional information for an existing dataset and can also be considered an information extraction task. A link prediction job on a drug-drug interaction knowledge network, for example, can assist us in relieving new information by suggesting probable predictions between any new pair of medications. For the task of link prediction, where every entity is considered as a target entity for a triple in testing data, metrics such as Mean Reciprocal Rank (MRR) and Hits@n are considered for evaluations. These scoring methods are then used to generate sorted scores based on corrupted or correct triplets .
MRR: Mean Reciprocal Rank calculates the mean of the reciprocals of vectors of ranking ranks. Also, MRR is less sensitive to outliers as compared to Mean Rank. MRR scores are standardised from 0 to 1, with 1 representing perfect ranking.
Here, refers to the rank position of the first relevant element for the i-th query .
Hits@n: It represents the model’s likelihood of ranking the relevant (true) fact in the top k element scores in the rank by reporting how many elements of vectors of ranking rank made it to the top n forecasts .
4.3 Graph Embedding Algorithms
The key characteristic of graph embeddings is that they hold the complex graph structure and interactions within themselves, with the distance between latent dimensions representing a metric for similarities between distinct graph elements . The following graph embedding models has been used:
TransE: An additive model which uses distance based scoring function for the link prediction task, treating edges of graph as linear transformations. The scoring function calculates a similarity between embedding of the head translated by the embedding of relationship and the embedding of the tail. Mathematically the scoring function is defined as below which involves using the or norms:
As TransE involves simple translation operation by forcing one to one mappings, it does not perform well with multirelational graphs.
DistMult: The algorithm uses bilinear diagonal modeling using the trilinear dot product scoring function  defined as follow:
Here h, r and t are embeddings of head, relationship and tail respectively. The score function has its own limitation as it deals with the symmetric relationships as the score function is only able to capture the pairwise interaction between components of h and t along the same dimensions .
ComplEx: This model represents entities and relationships as complex vector embeddings, which consist of two vectors, the real and imaginary . Fact assertion using the asymmetric score function defined below for fact (h,r,t):
Also is conjugate of t and results in providing the real part of the complex value. The algorithm is an extension of DistMult that involves complex embeddings leading to better modelling of asymmetric relations due to asymmetric scoring function. Now embeddings of h,r, and t no longer exist in real space, but complex space . For symmetric relationships, DistMult performs well with symmetric relations and ComplEx works well with antisymmetric relations .
Where the circular correlation operator is defined as:
HolE uses the simplicity of TrasnE and expressive power of the tensor product, to generate better embeddings. It is able to deal with asymmetric relationship due to the fact that circular correlation is not commutative i.e..
ConvE: Due its neural network structure it performs better in making non-linear transformations . It uses 2D convolution which is better at extracting interatcions between embeddings as compared to 1D convolution network. The scoring function is defined as:
Here, g is the non-linear activation function, vec indicated the 2D reshaping function and * is the linear convolution operator and W is the weight matrix. Also,and
are 2D reshaping of head entity embedding and the relationship embedding respectively, with loss function BCE used :
ConvKB: Unlike ConvE it utilizes 1D convolution and models the relationship among same dimensional entries of the embeddings , which leads it to generalize the translational characteristics present in the translation based models. It represent a k-dimension embedding of every triple which is viewed as a matrix . An operation is performed on each row of the generated matrix, the row i can be represented as . An additional filter is operated on every row to examine the global relationships, which enhances the translation characteristics of algorithm.
Finally the feature map v is generated, which is generated as follows:
Finally, these feature maps are used in scoring function of ConvKB where is convolution operation and are shared parameters.
4.4 Visualizing Knowledge Graph
The below visual represents a section of original KEGG knowledge graph where different entities are denoted using different colours. It can observed how few entities are acting as the bridge to connected multiple clusters in the node. It is clear through visualization that the number of connection per entity is highly imbalanced, with few nodes connected to a large section of other nodes creating a cluster of their own.
4.5 Novel Node Similarity Measure Using Graph Embedding and Link Prediction Techniques
We will use the cosine similarity measure between embeddings and information gathered from the link prediction task to improve the performance of the similarity measure by capturing various aspects of similarity. The link prediction outputs are transformed into probabilities using calibration where higher link prediction probability represents higher chances of a link existing between those two nodes. We exploit the fact that if two drugs are similar their interaction with other entities in the dataset should be similar i.e. their chance of linking with other drugs should follow similar trends. Our aim will be to find some kind of measure that can provide real value output to show the extent of similarity between two entities, for our use case these two entities will be two drugs so our system will act as a drug similarity system. The cosine similarity between two graph embeddings, each of length m defined asand is defined as follows:
A loss function is defined to measure the difference between the link prediction results of any two drugs with respect to all other possible entities in the dataset.
Here represents link prediction probability of drug A with entity i and stands of count of entities taken into account in this activity. Finally, MSE will give us a sense of the difference in interaction between two drugs with other entities. The final similarity measure will use information from both similarity measures using graph embeddings and the link prediction loss function results. The novel similarity measure is defined in next equation.
Here, MSE is link prediction loss, and are embeddings from drugs and and Sim represents the overall similarity score between a pair of drugs.
5.1 Model Settings
Like any other machine learning model, the performance of embeddings is dependent on the quantity and quality of data used, different hyperparameters. Our experiments use embedding sizes of 100, 200, 300 with loss functions Multiclass NLL Loss and Binary Cross Entropy Loss. Optimizers used are Adam (Adaptive Moment Estimation) and Stochastic Gradient Descent are considered with learning rates 1e-1 and 1e-3 for experimentation. The test includes regularization techniques that are L1 (Lasso Regression), L2 (Ridge Regression) and L3 (Nuclear 3-norm) proposed in the ComplEx-N3 paper with regularization constants 1e-5, 1e-2 and 1e-1. There are other hyperparameters which are valid for convolution graph neural net based models like ConvE and ConvKB. Their parameters are kept fixed as number of feature maps per convolution kernel as 32, convolution kernel size as 3, dropout at embedding layer, convolution layer and dense layer as 0.2, 0.3 and 0.2 respectively.
5.2 Performance of Algorithms on Link Prediction Task
The six different graph embedding model are trained on KEGG data which is split in 80:20 as train/test set and evaluated on the link prediction task using measures like MRR and Hits@k.
ComplEx has performed better than all other algorithms on the link prediction task providing the best MRR and Hits@k scores. Surprisingly, the complex graph convolution network models like ConvE and ConvKB have performed poorly on the KEGG dataset. Overall, ComplEx algorithm with embedding size of 300, Adam optimizer with learning rate of 1e-3 with Multiclass NLL Loss and L3 Regularization technique with 1e-2 as regularization constant provides best scores like 0.46 as MRR, 0.64, 0.49 and 0.37 as Hits@10, Hits@3 and Hits@1 respectively.
5.3 Visualizing Different Entity Type in 2D Embedding Space
The dimensionality reduction method PCA (Principal Component Analysis) is used to reduce the larger space embeddings to smaller spaces that are easier to visualise. The embeddings are transformed to 2D space and visualized on a plot marking the different entities present in dataset namely drug, disease, gene, network and pathway with different colours for better understanding of relationships between entities. It can be observed how the graphed embedding algorithm cleverly generated the embeddings such that different entities are present in the different sections of the plot, storing their personalities and entity type’s within them.
5.4 Insights from Link Predictions
The learned model is used to find possibilities for new unseen relationships between entities to provide meaningful insights. The model outputs the rank of the statement, score of the statement which is normalized to a definite range from 0 to 1 to give a sense of a comparative probabilities among different statements. The transformation of the scores (real number) to probabilities (between 0 and 1) is performed using the expit transformation, which takes any real number x and transforms it to a value in (0, 1). The model used for below predictions is the ComplEx model that has provided best scores, 0.46 as MRR, 0.64, 0.49 and 0.37 as Hits@10, Hits@3 and Hits@1 respectively.
|hsa04024 PATHWAY_GENE HSA:51196||236||3.704783||0.975985|
|D11034 DRUG_EFFICACY_DISEASE H00409||2||4.851221||0.992242|
|D04905 DRUG_TARGET_PATHWAY hsa05010||1||4.979891||0.993172|
|N00060 NETWORK_GENE HSA:23401||1||5.399962||0.995504|
|hsa04024 PATHWAY_GENE HSA:6336||19814||-0.21121||0.447393|
|D11056 DRUG_TARGET_GENE HSA:7388||7636||0.089212||0.522288|
|D11056 DRUG_TARGET_GENE HSA:3352||133||2.823472||0.943931|
|N00399 NETWORK_GENE HSA:9217||27812||-0.44892||0.389616|
|D04905 DRUG_TARGET_PATHWAY hsa04728||25||4.369776||0.987504|
|N00399 NETWORK_GENE N00399||16476||-0.05818||0.485458|
|H00242 DISEASE_GENE D11034||28037||-0.20788||0.448214|
|N00060 DRUG_TARGET_PATHWAY hsa04380||32017||-0.92776||0.283378|
The table shows that there are a few assertions for which the model projected fairly high odds. The probability of relationship between D11056 (Drug name : Mirtazapine hydrate) and HSA:3352 (Gene name : HTR1D, 5-HT1D, HT1DA, HTR1DA, HTRL, RDC4) under this model is quite high with 0.94, which suggests a possible connection between these two entities. Similarly, the relationship between D04905 (Drug name : Memantine hydrochloride) and pathway hsa04728 have a high connection probability of 0.987 under the model.
5.5 Drug Similarity System Using Graph Embeddings and Link Prediction
Graph embeddings generated by the trained graph model are utilized to determine the possible similarity of two drugs. A loss function like mean square error is used to measure the difference between the link prediction scores of drug A with other entities and drug B with other entities. The new similarity measure incorporates inputs from the cosine similarity between the embeddings of the pair of drugs and the link prediction score mean square error, which is expressed in section 4.5. The top 10 possibly similar drugs with respect to D00043 (Drug name : Isoflurophate Fluostigmine) sorted based on the novel similarity measure are defined below:
|1||D01228||0.933001||0.000289||3223.509127||Distigmine bromide (JP18/INN)|
|3||D03751||0.943170||0.000324||2915.059617||Icopezil maleate (USAN)|
|4||D02418||0.944881||0.000328||2884.443301||Physostigmine salicylate (JAN/USP)|
|5||D06288||0.918739||0.000339||2706.522993||Velnacrine maleate (USAN)|
|6||D00469||0.928133||0.000347||2677.510252||Pralidoxime chloride (USP)|
|7||D05981||0.934112||0.000366||2555.506934||Suronacrine maleate (USAN)|
|9||D03826||0.934139||0.000388||2404.652320||Physostigmine sulfate (USP)|
|10||D02068||0.915335||0.000388||2357.032571||Tacrine hydrochloride (USP)|
It can be observed that D01228 (Drug name : Distigmine bromide (JP18/INN)) provides the highest ratio score with respect to drug D00043 (Drug name : Isoflurophate Fluostigmine) due to its high cosine similarity and low link prediction mean square error, indicating a higher probability to perform similar to D00043 when interacted with other possible entities. Both drugs belong to the Neuropsychiatric agent class and are part of drug groups DG01595 (Drug group name : Cholinesterase inhibitor) and DG01593 (Drug group name : Acetylcholinesterase inhibitor) and targeting similar genes like HSA:43 (Gene name : ACHE, ACEE, ARACHE, N-ACHE, YT), indicating a high similarities between the pair of drugs. For evaluation we have used Tanimoto coefficient, which is used to calculate the chemical similarity between molecules. It is defined as below, where S represent molecular similarity between A and B, a represents the number of on bits in A, b is number of on bits in B and c represents number of on bits in both A and B.
The Tanimoto coefficient values for top drugs are 0.049, 0.056, 0.01, 0.046, 0.014, 0.021, 0.012, 0.77, 0.53. The chemical structure similarity is not capable to generalize the trend as there are drugs that treat same clinical problems but are different in structures, such as Migitol and Glipizide which both are used for diabetes but have completely different structures . Finally some clinical drug similarity experiments will be a good choice for evaluation which is kept out of scope from this research.
Overall, the use of graph embedding models is an excellent choice to tackle the problem of finding similarity of drugs, and can be easily scalable by incorporating other medical datasets to provide our graph embedding model with more connected and larger databases. More data which eventually lead our models to generalize better by producing less over-fitted model, eventually providing better performance on link prediction task. Graph embeddings models using neural networks, convolution neural networks and attention models is an exciting field which can help us learn better about the complex graph structure their nodes and edges. Representing entities and relationships of a knowledge graph in a form of low dimensional embedding can be used for various graphical structures to find new insights using operations like link predictions. This work effectively contributes a machine learning pipeline that uses graph embeddings and link prediction algorithms to uncover drug similarity and capture unique relationships and possibilities in a biomedical database, with systematic comparison between different graph embedding algorithms.
-  Joanne Bowes and Andrew J. Brown and Jacques Hamon and Wolfgang Jarolimek and Arun Sridhar and Gareth Waldron and Steven Whitebread: Reducing safety-related drug attrition: the use of in vitro pharmacological profiling. Nature Reviews Drug Discovery 11(12):909-22, 2012.
-  Hogan, Aidan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard De Melo, Claudio Gutierrez, Sabrina Kirrane et al: Knowledge graphs. ACM Computing Surveys (CSUR) 54, no. 4 (2021): 1-37, 2021.
-  David N.Nicholson and Casey S.Greene: Constructing knowledge graphs and their biomedical applications. Computational and Structural Biotechnology Journal Vol. 18, (2020): 1414-1428, 2020.
-  Md. Rezaul Karim and Michael Cochez and Joao Bosco Jares and Mamtaz Uddin and Oya Beyan and Stefan Decker: Drug-Drug Interaction Prediction Based on Knowledge Graph Embeddings and Convolutional-LSTM Network. Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics. 2019.
-  Bishan Yang and Wen-tau Yih and Xiaodong He and Jianfeng Gao and Li Deng: Embedding Entities and Relations for Learning and Inference in Knowledge Bases. arXiv preprint arXiv:1412.6575, 2014.
-  Quan Wang and Zhendong Mao and Bin Wang and Li Guo: Knowledge Graph Embedding: A Survey of Approaches and Applications. IEEE Transactions on Knowledge and Data Engineering, 29 (12):2724–2743, 2017.
-  Meihong Wang and Linling Qiu and Xiaoli Wang.: A Survey on Knowledge Graph Embeddings for Link Prediction. Symmetry 13, no. 3: 485, 2021.
-  Ilya Makarov and Dmitrii Kiselev and Nikita Nikitinsky and Lovro Subelj: Survey on graph embeddings and their applications to machine learning problems on graphs. PeerJ Computer Science 7:e357 https://doi.org/10.7717/peerj-cs.357, 2021.
-  Minoru Kanehisa and Miho Furumichi and Mao Tanabe and Yoko Sato and Kanae Morishima.: KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017;45(D1):D353-D361, 2017.
-  Rajeev Verma and Dr. Preetam Kumar: Knowledge Graph Representation Learning Based Drug Informatics. Proceedings 2019 IEEE International Conference on Electronics, Computing and Commu- nication Technologies (CONECCT), pages 1–4, 2019.
-  Sameh K. Mohamed and Aayah Nounu and Vít Novácek: Drug target discovery using knowledge graph embeddings. Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC 2019), 2019.
-  Ian T. Jolliffe and Jorge Cadima. Principal component analysis: a review and recent developments. Phil. Trans. R. Soc. A. 374:20150202, 2016.
-  Alberto García-Durán and Antoine Bordes and Nicolas Usunier and Yves Grandvalet: Combining Two And Three-Way Embeddings Models for Link Prediction in Knowledge Bases. J. Artif. Intell. Res. 55 (2016): 715-742, 2016.
-  Muhan Zhang and Yixin Chen: Link Prediction Based on Graph Neural Networks. Proceedings 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), 2018.
Zecheng Zhang and Danni Ma and Xiaohan Li: Link Prediction with Graph Neural Networks and Knowledge Extraction. CS230: Deep Learning, Spring 2020, Stanford University, CA.
-  Zhen Tan and Xiang Zhao and Yang Fang and Bin Ge and Weidong Xiao: Knowledge Graph Representation via Similarity-Based Embedding. Scientific Programming 2018:1-12, 2018.
-  Antoine Bordes and Nicolas Usunier and Alberto Garcia-Duran and Jason Weston and Oksana Yakhnenko: Translating Embeddings for Modeling Multi-relational Data. Proceedings Advances in Neural Information Processing Systems 26 (NIPS 2013), 2013.
-  Bryan Perozzi and Rami Al-Rfou and Steven Skiena: DeepWalk: Online Learning of Social Representations. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014.
-  Théo Trouillon and Johannes Welbl and Sebastian Riedel and Eric Gaussier and Guillaume Bouchard: Complex Embeddings for Simple Link Prediction. Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:2071-2080, 2016.
-  Xiaofei Shi and Yanghua Xiao: Modeling Multi-mapping Relations for Precise Cross-lingual Entity Alignment. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
Maximilian Nickel and Lorenzo Rosasco and Tomaso Poggio: Holographic Embeddings of Knowledge Graphs. Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 30. No. 1. 2016.
-  Pasquale Minervini and Pontus Stenetorp and Sebastian Riedel: Convolutional 2D Knowledge Graph Embeddings. Proceedings Thirty-second AAAI Conference on Artificial Intelligence, 2018.
-  Dai Quoc Nguyen and Tu Dinh Nguyen and Dat Quoc Nguyen and Dinh Phung: A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network. Proceedings The 2018 Conference of the North American Chapter of the Association for Computational Linguistics, 2018.
-  Timothée Lacroix and Nicolas Usunier and Guillaume Obozinski: Canonical Tensor Decomposition for Knowledge Base Completion. Proceedings of the International Conference on Machine Learning. PMLR, 2018.
-  Xian Zengb and Zheng Jiab and Zhiqiang Hec and Weihong Chenc and Xudong Lub and Huilong Duanb and Haomin Lia: Measure Clinical Drug–Drug Similarity Using Electronic Medical Records. International Journal of Medical Informatics Vol. 124, April 2019, Pages 97-103, 2019.