The understanding of the type of inhibitory interaction plays an important role in drug design. Therefore, researchers are interested to know whether a drug has competitive or non-competitive interaction to particular protein targets. Method: to analyze the interaction types we propose factorization method Macau which allows us to combine different measurement types into a single tensor together with proteins and compounds. The compounds are characterized by high dimensional 2D ECFP fingerprints. The novelty of the proposed method is that using a specially designed noise injection MCMC sampler it can incorporate high dimensional side information, i.e., millions of unique 2D ECFP compound features, even for large scale datasets of millions of compounds. Without the side information, in this case, the tensor factorization would be practically futile. Results: using public IC50 and Ki data from ChEMBL we trained a model from where we can identify the latent subspace separating the two measurement types (IC50 and Ki). The results suggest the proposed method can detect the competitive inhibitory activity between compounds and proteins.READ FULL TEXT VIEW PDF
In pharmaceutical research two properties, affinity and potency, are commonly measured in-vitro. Affinity measures the concentration of the tested substance needed to occupy a given portion of the available binding sites, usually measured by the inhibitory constant which have an intuitive meaning: the free substance concentration in chemical equilibrium where the half of the available binding sites are occupied. This value is independent of the concentration of the protein.
Potency measures the concentration of the substance to reach a given level of effect. In case of inhibitory compounds most often measured by relative . Relative is the concentration of the substance, where the halfway activity between the baseline and the maximal inhibition is reached.
Assuming Michaelis-Menten kinetics, it is easy to see that if there is no interaction between the binding processes, even in the case of partial inhibitors. This can be true when there is only one substance binding to the binding site of interest, so the binding site of the inhibitor is allosteric to the binding site where the effect is measured, illustrated on the left panel of Fig. 1. If, however, there is competition between the inhibitor and an endogenous ligand (illustrated on the right panel of Fig. 1), will be dependent of the concentrations of the endogenous ligand and its interaction with the protein. The relation in the logarithmic scale has the form :
where , , is the concentration of the endogenous substrate , is the Michaelis constant of the enzyme with respect to , and is an additive constant with respect to , depends on the experimental conditions. This form holds in the case of uncompetitive inhibition as well.
We want to build a Bayesian model for the Eq. 1 by jointly modeling both and , using data from different proteins and compounds. The joint model can capture whether the deviation term is zero (corresponding to non-competitive interaction) or non-zero (corresponding to competitive interaction). Therefore, we propose a model that uses low-rank approximation of the 3-way tensor where is the measured inhibition of protein with compound for the given inhibition measure (either or ):
where are the respective
-dimensional latent vectors andis a vector of ones. The tensor is thin as there are only two measure types.
It is a natural property of tensor factorization that additive and uncorrelated effects will occupy different latent dimensions. This is the case for the Eq. 1 and, thus, we expect few specific latent dimensions to capture the difference term .
However, one important issue with the factorization (2) is that the tensor is very sparsely observed with less than 1% measured values. To solve this issue we propose to use 2D ECFP chemical features as a side information for compounds while performing the factorization (2). The ECFP features represent the substructures of the compound, which are strongly related to its interaction properties. The proposed model learns a link matrix to predict latent vectors from features , where is the number of side information features. As will become evident from the experiments the incorporation of ECFP features gives a very strong improvement to the accuracy of the model. The general setup for the model is depicted in Fig. 2.
To sample from this model we develop a Bayesian factorization approach Macau based on Gibbs sampling. The novelty of Macau is that in can scale well with respect to the number of features . This is crucial for the current application as the dimension of the ECFP features is more than 100,000. In contrast, the recent sampling-based Bayesian factorization methods Bayesian Matrix Factorization with Side Information (BMFSI)  and Bayesian Canonical PARAFAC with Features and Networks (BCPFN)  require explicitly computing the covariance matrix of size for linking the features to the factorization. Additionally, to sample the link vector Cholesky decomposition is necessary, taking at least , which makes these methods intractable for high-dimensional feature spaces.
Instead, Macau samples the link matrix without explicitly computing its covariance matrix by using a specially designed noise injection sampler that we describe in the Section III-A.
In this section we first describe the statistical model, and explain related work in tensor factorization literature.
The observation noise is assumed to be Gaussian, as was proposed in Bayesian Probabilistic Matrix Factorization ,
is the normal distribution,is the set of tensor cells whose value has been observed and is the noise precision.
Next we place a common multivariate Gaussian prior on the latent vectors of compounds
where and are the shared intercept vector and precision matrix for all compounds. The term allows the prior mean of to depend on the compound features and, therefore, have informed predictions also for compounds with very few or no measurements. We use analogous prior also for the latent vectors of proteins and measurements . For proteins we use basic sequence based features, described in more detail Section IV-A. Without the in the prior (4) our model is equivalent to Bayesian Probabilistic Tensor Factorization . Originally, the adjustment of the prior mean was first proposed by Agarwal et al.  for non-Bayesian models. Recently, this side information adjustment was used for sparse tensor factorization in BCPFN .
The standard approach in the previous works, for example used by Rai et al. , has been to add a zero-mean Gaussian prior on , i.e., , where
is the identity matrix andis precision value.
Instead, we propose a prior whose scale depends on , making the strength of the prior invariant to the scale of the latent variables:
where is Kroenecker product. The prior has advantage of being well suited for sampling, as we show in the next section.
We develop a block Gibbs sampler for the model. For most variables the samplers are straight-forward, and due to space restrictions are left to an extended version of the paper . The crucial question is how to sample whose dimensions in our experiments are but can be even larger for bigger datasets.
The main idea of our noise injection sampler is that we will construct a specific linear system whose solution corresponds to a sample from the conditional probability of. For large feature dimension the linear system of size will be solved without explicitly computing it by using iterative solvers, which only require (sparse) matrix multiplication operations.
where for conciseness we have dropped the conditional terms after and use shorthands and . From this we can derive the Gaussian mean and precision of as
with respect to , where and are noise matrices where each row is drawn from . We call (8) the noise injection sampler as it adds noise to the right-hand sides of the linear systems. The correctness of the noise injection sampler is proved in extended version, see . The RHS of the system (8) has dimension , i.e., there are total of systems, which can be solved independently, using embarrassingly parallel approach.
We propose to use conjugate gradient to solve (8). In its iterations conjugate gradient only requires multiplication of with a dense vector. This allows us to handle side information with millions of sparse features because the multiplication can be expressed as two sparse-matrix-vector-products and an addition:
In an analogous way, we introduced side information for the proteins.
The software implementation of Macau is freely available111https://github.com/jaak-s/BayesianDataFusion.jl.
The inhibitory activities were obtained from ChEMBL  version 19. The dimensions of the tensor are , and , with 59,280 observed values for , and 5,121 for .
We generate ECFP features of radius 3 for the compounds by using RDKit software. These features are sparse and high dimensional with .
To have a basic description for proteins we use protein features based on information extracted directly from position-specific frequency matrices (PSFM). Accordingly, we link each protein to a 20 dimensional feature vector created by summing up each column in the PSFM profile as done in .
In this subsection we present empirical results on ChEMBL dataset. In all experiments we used Macau with latent dimensions.
We compare Macau with and without side information in a hold-out validation setting, which we repeat 10 times. On each run, we randomly select 20% of the
measurements as the test set. The mean and standard deviation of the RMSEs show that the side information plays a crucial role to have accurate predictions for, see Table I.
|Macau with SI||0.5970 (0.0054)|
|Macau without SI||0.8703 (0.0085)|
We compare the latent vectors of and , i.e. and . Figure 3 depicts the normalized222For normalization both and are multiplied by Euclidean norms of compounds and proteins . values of latent dimensions in the two vectors and , from a single posterior sample. As we can see from the figure, in three latent dimensions, colored green, red and yellow, the values corresponding to and differ significantly. The difference term in (1) is captured by these three latent dimensions, as expected.
We used the identified three latent dimensions from the previous section to rank the proteins according to their predicted binding behavior. Let be an indicator vector where -th element is 1 exactly if the -th latent vector is one of the 3 selected. We computed the absolute difference between the predicted and values
where is the latent vector for , and is for . Since our model is Bayesian, we average over all posterior samples, as in the previous sections, and denote it by .
Next, we computed the 0.95 quantile offor each protein
, which is a robust estimate for the maximal difference termin (1) for protein . We order all the proteins according to this quantile, and extracted the top and the bottom 10 of this list. We expect that the top 10 proteins, shown in Table II, to have strong competitive interactions. As can be seen from the table, all 10 are enzymes, which makes it probable that our expectation is correct, because enzymes usually have some strongly competitive inhibitors. Secondly, we expect the bottom 10 proteins, shown in Table III, to be generally non-competitive, which also seems to be the case because there are 7 receptors and 2 ion channels. To understand what measures for receptors, one has to know how the assays are designed. In the assays of receptors and ion channels, the inhibitory response is measured on the intracellular end of the signal, which is non-competitive, even though on the extracellular site there could be competition for the site. An alternative explanation is that the proposed model detects deviations from Michaelis-Menten kinetics.
As the tables were constructed only using the selected 3 latent dimensions, it shows that the tensor factorization has the ability to extract the differential term into few latent dimensions.
In this experiment we try a challenging task to predict if a compound-protein pair have competitive interaction (). We select 30 compound-protein pairs to the test set, for which both and are available. 10 pairs are randomly selected from the top 100 pairs having the highest positive difference , next 10 from the 100 pairs having the highest negative difference, and the final 10 from the 100 pairs having the least absolute difference. From the Macau predictions (using all latent dimensions) we compute the difference for the pairs in the test set.
Macau is able to capture signal from the data as the mean predicted absolute difference,
, for the competitive pairs is 0.671 and for the non-competitive pairs is 0.234. The difference is statistically significant at 1% level (with t-test p-value of 0.0047).
|ChEMBL ID||Protein name|
|CHEMBL284||Dipeptidyl peptidase IV|
|CHEMBL325||Histone deacetylase 1|
|CHEMBL260||MAP kinase p38 alpha|
|CHEMBL1865||Histone deacetylase 6|
|CHEMBL1937||Histone deacetylase 2|
|CHEMBL289||Cytochrome P450 2D6|
|CHEMBL4005||PI3-kinase p110-alpha subunit|
|CHEMBL1978||Cytochrome P450 19A1|
|CHEMBL4793||Dipeptidyl peptidase IX|
|ChEMBL ID||Protein name|
|CHEMBL3772||Metabotropic glutamate receptor 1|
|CHEMBL5145||Serine/threonine-protein kinase B-raf|
|CHEMBL3663||Growth factor receptor-bound protein 2|
|CHEMBL4641||Voltage-gated T-type calcium channel alpha-1G subunit|
|CHEMBL3230||Sphingosine 1-phosphate receptor Edg-6|
|CHEMBL2001||Purinergic receptor P2Y12|
|CHEMBL1785||Endothelin receptor ET-B|
|CHEMBL287||Sigma opioid receptor|
|CHEMBL3227||Metabotropic glutamate receptor 5|
In this paper we presented a novel scalable Bayesian tensor factorization method with side information in a novel application for the prediction of drug-protein interaction types.
We showed that the method can capture the difference between binding affinity and potency from highly sparse data. With this model we can predict the dominant interactions types of the proteins and the competitiveness of the inhibition between compound-protein pairs.
The future work is to execute this factorization model on large scale industrial dataset and make it directly usable for drug design process. We also plan to extend the model for slow tight-binding inhibitors.
Authors thank Thierry Louat for insightful discussions. Jaak Simm, Adam Arany, Pooya Zakeri and Yves Moreau are funded by Research Council KU Leuven (CoE PFV/10/016 SymBioSys) and by Flemish Government (IOF, Hercules Stitching, iMinds Medical Information Technologies SBO 2015, IWT: O&O ExaScience Life Pharma; ChemBioBridge, PhD grants).
AAAI Conference on Artificial Intelligence, 2015.
R. Salakhutdinov and A. Mnih, “Bayesian probabilistic matrix factorization using markov chain monte carlo,” inProceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 880–887.