Highly Scalable Tensor Factorization for Prediction of Drug-Protein Interaction Type

by   Adam Arany, et al.

The understanding of the type of inhibitory interaction plays an important role in drug design. Therefore, researchers are interested to know whether a drug has competitive or non-competitive interaction to particular protein targets. Method: to analyze the interaction types we propose factorization method Macau which allows us to combine different measurement types into a single tensor together with proteins and compounds. The compounds are characterized by high dimensional 2D ECFP fingerprints. The novelty of the proposed method is that using a specially designed noise injection MCMC sampler it can incorporate high dimensional side information, i.e., millions of unique 2D ECFP compound features, even for large scale datasets of millions of compounds. Without the side information, in this case, the tensor factorization would be practically futile. Results: using public IC50 and Ki data from ChEMBL we trained a model from where we can identify the latent subspace separating the two measurement types (IC50 and Ki). The results suggest the proposed method can detect the competitive inhibitory activity between compounds and proteins.



There are no comments yet.


page 1

page 2

page 3

page 4


Macau: Scalable Bayesian Multi-relational Factorization with Side Information using MCMC

We propose Macau, a powerful and flexible Bayesian factorization method ...

Predicting drug-target interactions via sparse learning

Drug-target interaction (DTI) prediction plays a very important role in ...

Bi-Level Graph Neural Networks for Drug-Drug Interaction Prediction

We introduce Bi-GNN for modeling biological link prediction tasks such a...

PADME: A Deep Learning-based Framework for Drug-Target Interaction Prediction

In silico Drug-target Interaction (DTI) prediction is an important and c...

STNN-DDI: A Substructure-aware Tensor Neural Network to Predict Drug-Drug Interactions

Motivation: Computational prediction of multiple-type drug-drug interact...

Large Scale Tensor Regression using Kernels and Variational Inference

We outline an inherent weakness of tensor factorization models when late...

Bayesian crack detection in ultra high resolution multimodal images of paintings

The preservation of our cultural heritage is of paramount importance. Th...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In pharmaceutical research two properties, affinity and potency, are commonly measured in-vitro. Affinity measures the concentration of the tested substance needed to occupy a given portion of the available binding sites, usually measured by the inhibitory constant which have an intuitive meaning: the free substance concentration in chemical equilibrium where the half of the available binding sites are occupied. This value is independent of the concentration of the protein.

Potency measures the concentration of the substance to reach a given level of effect. In case of inhibitory compounds most often measured by relative . Relative is the concentration of the substance, where the halfway activity between the baseline and the maximal inhibition is reached.

Assuming Michaelis-Menten kinetics, it is easy to see that if there is no interaction between the binding processes, even in the case of partial inhibitors. This can be true when there is only one substance binding to the binding site of interest, so the binding site of the inhibitor is allosteric to the binding site where the effect is measured, illustrated on the left panel of Fig. 1. If, however, there is competition between the inhibitor and an endogenous ligand (illustrated on the right panel of Fig. 1), will be dependent of the concentrations of the endogenous ligand and its interaction with the protein. The relation in the logarithmic scale has the form [1]:


where , , is the concentration of the endogenous substrate , is the Michaelis constant of the enzyme with respect to , and is an additive constant with respect to , depends on the experimental conditions. This form holds in the case of uncompetitive inhibition as well.

Fig. 1: Two main types of inhibition.

We want to build a Bayesian model for the Eq. 1 by jointly modeling both and , using data from different proteins and compounds. The joint model can capture whether the deviation term is zero (corresponding to non-competitive interaction) or non-zero (corresponding to competitive interaction). Therefore, we propose a model that uses low-rank approximation of the 3-way tensor where is the measured inhibition of protein with compound for the given inhibition measure (either or ):


where are the respective

-dimensional latent vectors and

is a vector of ones. The tensor is thin as there are only two measure types.

It is a natural property of tensor factorization that additive and uncorrelated effects will occupy different latent dimensions. This is the case for the Eq. 1 and, thus, we expect few specific latent dimensions to capture the difference term .

I-a Side Information

However, one important issue with the factorization (2) is that the tensor is very sparsely observed with less than 1% measured values. To solve this issue we propose to use 2D ECFP chemical features as a side information for compounds while performing the factorization (2). The ECFP features represent the substructures of the compound, which are strongly related to its interaction properties. The proposed model learns a link matrix to predict latent vectors from features , where is the number of side information features. As will become evident from the experiments the incorporation of ECFP features gives a very strong improvement to the accuracy of the model. The general setup for the model is depicted in Fig. 2.

Fig. 2: Tensor factorization for and modeling with side information on compounds.

To sample from this model we develop a Bayesian factorization approach Macau based on Gibbs sampling. The novelty of Macau is that in can scale well with respect to the number of features . This is crucial for the current application as the dimension of the ECFP features is more than 100,000. In contrast, the recent sampling-based Bayesian factorization methods Bayesian Matrix Factorization with Side Information (BMFSI) [2] and Bayesian Canonical PARAFAC with Features and Networks (BCPFN) [3] require explicitly computing the covariance matrix of size for linking the features to the factorization. Additionally, to sample the link vector Cholesky decomposition is necessary, taking at least , which makes these methods intractable for high-dimensional feature spaces.

Instead, Macau samples the link matrix without explicitly computing its covariance matrix by using a specially designed noise injection sampler that we describe in the Section III-A.

Ii Statistical Model

In this section we first describe the statistical model, and explain related work in tensor factorization literature.

Ii-a General Setup

The observation noise is assumed to be Gaussian, as was proposed in Bayesian Probabilistic Matrix Factorization [4],



is the normal distribution,

is the set of tensor cells whose value has been observed and is the noise precision.

Next we place a common multivariate Gaussian prior on the latent vectors of compounds


where and are the shared intercept vector and precision matrix for all compounds. The term allows the prior mean of to depend on the compound features and, therefore, have informed predictions also for compounds with very few or no measurements. We use analogous prior also for the latent vectors of proteins and measurements . For proteins we use basic sequence based features, described in more detail Section IV-A. Without the in the prior (4) our model is equivalent to Bayesian Probabilistic Tensor Factorization [5]. Originally, the adjustment of the prior mean was first proposed by Agarwal et al. [6] for non-Bayesian models. Recently, this side information adjustment was used for sparse tensor factorization in BCPFN [3].

Ii-B Scale Invariant Prior on the Link Matrix

The standard approach in the previous works, for example used by Rai et al. [3], has been to add a zero-mean Gaussian prior on , i.e., , where

is the identity matrix and

is precision value.

Instead, we propose a prior whose scale depends on , making the strength of the prior invariant to the scale of the latent variables:


where is Kroenecker product. The prior has advantage of being well suited for sampling, as we show in the next section.

Iii Gibbs Sampler

We develop a block Gibbs sampler for the model. For most variables the samplers are straight-forward, and due to space restrictions are left to an extended version of the paper [7]. The crucial question is how to sample whose dimensions in our experiments are but can be even larger for bigger datasets.

The main idea of our noise injection sampler is that we will construct a specific linear system whose solution corresponds to a sample from the conditional probability of

. For large feature dimension the linear system of size will be solved without explicitly computing it by using iterative solvers, which only require (sparse) matrix multiplication operations.

Iii-a Noise Injection Sampler for the Link Matrix

The conditional posterior of , used for Gibbs sampler, combines (4) and (5). Due to scale invariance of (5) we get


where for conciseness we have dropped the conditional terms after and use shorthands and . From this we can derive the Gaussian mean and precision of as


where . Naively sampling from Gaussian (7) would cost time. However, due to the special structure of the distribution (7) we can generate its sample by solving following linear systems


with respect to , where and are noise matrices where each row is drawn from . We call (8) the noise injection sampler as it adds noise to the right-hand sides of the linear systems. The correctness of the noise injection sampler is proved in extended version, see [7]. The RHS of the system (8) has dimension , i.e., there are total of systems, which can be solved independently, using embarrassingly parallel approach.

We propose to use conjugate gradient to solve (8). In its iterations conjugate gradient only requires multiplication of with a dense vector. This allows us to handle side information with millions of sparse features because the multiplication can be expressed as two sparse-matrix-vector-products and an addition:


In an analogous way, we introduced side information for the proteins.

The software implementation of Macau is freely available111https://github.com/jaak-s/BayesianDataFusion.jl.

Iv Experiments

Iv-a Data

The inhibitory activities were obtained from ChEMBL [8] version 19. The dimensions of the tensor are , and , with 59,280 observed values for , and 5,121 for .

We generate ECFP features of radius 3 for the compounds by using RDKit software[9]. These features are sparse and high dimensional with .

To have a basic description for proteins we use protein features based on information extracted directly from position-specific frequency matrices (PSFM). Accordingly, we link each protein to a 20 dimensional feature vector created by summing up each column in the PSFM profile as done in [10].

Iv-B Results

In this subsection we present empirical results on ChEMBL dataset. In all experiments we used Macau with latent dimensions.

Iv-B1 Effect of side information

We compare Macau with and without side information in a hold-out validation setting, which we repeat 10 times. On each run, we randomly select 20% of the

measurements as the test set. The mean and standard deviation of the RMSEs show that the side information plays a crucial role to have accurate predictions for

, see Table I.

Method RMSE (Stdev)
Macau with SI 0.5970 (0.0054)
Macau without SI 0.8703 (0.0085)
TABLE I: Effect of side information on RMSE (on test set)

Iv-B2 Latent dimensions capturing the difference term

We compare the latent vectors of and , i.e. and . Figure 3 depicts the normalized222For normalization both and are multiplied by Euclidean norms of compounds and proteins . values of latent dimensions in the two vectors and , from a single posterior sample. As we can see from the figure, in three latent dimensions, colored green, red and yellow, the values corresponding to and differ significantly. The difference term in (1) is captured by these three latent dimensions, as expected.

As can be seen from the animation333Animation is available at https://youtu.be/RSrVfUwYYw4, the behavior of these 3 latent dimensions through the posterior samples is stable and is qualitatively identical to Figure 3. Next we investigate the predictive properties of these 3 latent dimensions.

Fig. 3: Normalized latent dimensions of measurement types and .

Iv-B3 Prediction of dominant interaction types for proteins

We used the identified three latent dimensions from the previous section to rank the proteins according to their predicted binding behavior. Let be an indicator vector where -th element is 1 exactly if the -th latent vector is one of the 3 selected. We computed the absolute difference between the predicted and values

where is the latent vector for , and is for . Since our model is Bayesian, we average over all posterior samples, as in the previous sections, and denote it by .

Next, we computed the 0.95 quantile of

for each protein

, which is a robust estimate for the maximal difference term

in (1) for protein . We order all the proteins according to this quantile, and extracted the top and the bottom 10 of this list. We expect that the top 10 proteins, shown in Table II, to have strong competitive interactions. As can be seen from the table, all 10 are enzymes, which makes it probable that our expectation is correct, because enzymes usually have some strongly competitive inhibitors. Secondly, we expect the bottom 10 proteins, shown in Table III, to be generally non-competitive, which also seems to be the case because there are 7 receptors and 2 ion channels. To understand what measures for receptors, one has to know how the assays are designed. In the assays of receptors and ion channels, the inhibitory response is measured on the intracellular end of the signal, which is non-competitive, even though on the extracellular site there could be competition for the site. An alternative explanation is that the proposed model detects deviations from Michaelis-Menten kinetics.

As the tables were constructed only using the selected 3 latent dimensions, it shows that the tensor factorization has the ability to extract the differential term into few latent dimensions.

Iv-B4 Prediction of compound-protein interaction types

In this experiment we try a challenging task to predict if a compound-protein pair have competitive interaction (). We select 30 compound-protein pairs to the test set, for which both and are available. 10 pairs are randomly selected from the top 100 pairs having the highest positive difference , next 10 from the 100 pairs having the highest negative difference, and the final 10 from the 100 pairs having the least absolute difference. From the Macau predictions (using all latent dimensions) we compute the difference for the pairs in the test set.

Macau is able to capture signal from the data as the mean predicted absolute difference,

, for the competitive pairs is 0.671 and for the non-competitive pairs is 0.234. The difference is statistically significant at 1% level (with t-test p-value of 0.0047).

ChEMBL ID Protein name
CHEMBL284 Dipeptidyl peptidase IV
CHEMBL325 Histone deacetylase 1
CHEMBL260 MAP kinase p38 alpha
CHEMBL1865 Histone deacetylase 6
CHEMBL1937 Histone deacetylase 2
CHEMBL289 Cytochrome P450 2D6
CHEMBL4005 PI3-kinase p110-alpha subunit
CHEMBL1978 Cytochrome P450 19A1
CHEMBL2581 Cathepsin D
CHEMBL4793 Dipeptidyl peptidase IX
TABLE II: TOP10 Proteins with predicted competitive interaction
ChEMBL ID Protein name
CHEMBL3772 Metabotropic glutamate receptor 1
CHEMBL5145 Serine/threonine-protein kinase B-raf
CHEMBL3663 Growth factor receptor-bound protein 2
CHEMBL4641 Voltage-gated T-type calcium channel alpha-1G subunit
CHEMBL3230 Sphingosine 1-phosphate receptor Edg-6
CHEMBL2001 Purinergic receptor P2Y12
CHEMBL1785 Endothelin receptor ET-B
CHEMBL287 Sigma opioid receptor
CHEMBL3227 Metabotropic glutamate receptor 5
TABLE III: TOP10 Proteins with predicted non-competitive interaction

V Conclusion

In this paper we presented a novel scalable Bayesian tensor factorization method with side information in a novel application for the prediction of drug-protein interaction types.

We showed that the method can capture the difference between binding affinity and potency from highly sparse data. With this model we can predict the dominant interactions types of the proteins and the competitiveness of the inhibition between compound-protein pairs.

The future work is to execute this factorization model on large scale industrial dataset and make it directly usable for drug design process. We also plan to extend the model for slow tight-binding inhibitors.


Authors thank Thierry Louat for insightful discussions. Jaak Simm, Adam Arany, Pooya Zakeri and Yves Moreau are funded by Research Council KU Leuven (CoE PFV/10/016 SymBioSys) and by Flemish Government (IOF, Hercules Stitching, iMinds Medical Information Technologies SBO 2015, IWT: O&O ExaScience Life Pharma; ChemBioBridge, PhD grants).


  • [1] C. Yung-Chi and W. H. Prusoff, “Relationship between the inhibition constant (k i) and the concentration of inhibitor which causes 50 per cent inhibition (i 50) of an enzymatic reaction,” Biochemical pharmacology, vol. 22, no. 23, pp. 3099–3108, 1973.
  • [2] I. Porteous, A. U. Asuncion, and M. Welling, “Bayesian matrix factorization with side information and dirichlet process mixtures.” in AAAI, 2010.
  • [3] P. Rai, Y. Wang, and L. Carin, “Leveraging features and networks for probabilistic tensor decomposition,” in

    AAAI Conference on Artificial Intelligence

    , 2015.
  • [4]

    R. Salakhutdinov and A. Mnih, “Bayesian probabilistic matrix factorization using markov chain monte carlo,” in

    Proceedings of the 25th international conference on Machine learning.   ACM, 2008, pp. 880–887.
  • [5] L. Xiong, X. Chen, T.-K. Huang, J. G. Schneider, and J. G. Carbonell, “Temporal collaborative filtering with bayesian probabilistic tensor factorization.” in SDM, vol. 10.   SIAM, 2010, pp. 211–222.
  • [6] D. Agarwal and B.-C. Chen, “Regression-based latent factor models,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2009, pp. 19–28.
  • [7] J. Simm, A. Arany, P. Zakeri, T. Haber, J. K. Wegner, V. Chupakhin, H. Ceulemans, and Y. Moreau, “Macau: Scalable Bayesian Multi-relational Factorization with Side Information using MCMC,” ArXiv e-prints, Sep. 2015.
  • [8] A. P. Bento, A. Gaulton, A. Hersey, L. J. Bellis, J. Chambers, M. Davies, F. A. Krüger, Y. Light, L. Mak, S. McGlinchey et al., “The chembl bioactivity database: an update,” Nucleic acids research, vol. 42, no. D1, pp. D1083–D1090, 2014.
  • [9] Rdkit: Open-source cheminformatics. [Online]. Available: http://www.rdkit.org
  • [10] P. Zakeri, B. Jeuris, R. Vandebril, and Y. Moreau, “Protein fold recognition using geometric kernel data fusion,” Bioinformatics, p. btu118, 2014.