I Introduction
Proteins are the functional units within an organism that form molecular machines organized by their proteinprotein interactions (PPIs) to carry out many biological and molecular processes. The study of PPIs not only plays a crucial role in understanding biological phenomena but also provides insights about the molecular etiology of diseases as well as the discovery of putative drug targets [22]. Although a large amount of reliable PPIs has been experimentally identified, the discovery of all possible PPIs through these experiments is intractable. Thus, efficient computational methods capable of accurately predicting PPIs and speeding up new PPI discovery are needed.
The primary structure of a protein, the protein sequence, determines the protein’s unique threedimensional shape, giving rise to an assumption that knowledge of the protein sequence alone might be sufficient to model the interaction between two proteins [2, 27]. There is a longstanding interest in predicting PPIs from protein sequences which are by far the most abundant data available for proteins [29]
. Traditional methods proposed to model PPIs involve extracting features based on domain expertise and training machine learning model on these features
[12, 29, 30]. The performance of these methods relies heavily on the capability to select appropriate features, while the extracted features lack enough information about the interactions.Recently, the stateoftheart deep learning methods in the Siamese architecture have been proposed to automatically extract features from sequences to model interactions in an endtoend framework [15, 4]
. Siamese architecture for PPI prediction consists of two identical neural networks each taking one of the protein sequences and modeling their mutual influence. Although these methods perform quite well on small datasets, these methods involve the pairwise encoding process, making it computationally inefficient for a large number of interactions
[16]. This has significant memory and runtime limitations when training with large datasets.In this work, we propose a novel method to address these challenges. Specifically, we aim to learn a bidirectional GRU (BiGRU) model that maps variablelength sequences to a sequence of vector representations  one per amino acid position that encodes the sequential and contextual properties of amino acids. Since proteins interact with other proteins that perform different functions within a cell and even have different sequence patterns, we further encode the representation to the Gaussian distribution instead of a single point to capture uncertainty about its representation. We then define the cost function that incorporates the contrastive criteria between the latent Gaussian distributions such that the similarity between these distributions effectively captures complex interactions between proteins. It allows our model to minimize the statistical distance between interacting proteins while maximizing the distance for noninteracting proteins.
Alongside making accurate PPI predictions, it is crucial to have interpretable models that help domain experts understand how individual amino acids in the sequence contribute to the model’s decisions. However, none of the stateoftheart methods can provide such interpretability, limiting their practicality from biological perspectives. Since only a few amino acids in the interface region of the sequences are involved in interactions with other proteins [28], we, therefore, design a sparse and structured gate mechanism to guide the model to selectively focus on few amino acids in the sequence. The sparse gating mechanism outputs sparse weights  one per amino acid position  that explains how much contribution each amino acid makes and thus enhances interpretability.
Experimental results show that our method outperforms stateoftheart methods on two challenging datasets: yeast and human PPIs from the BioGRID interaction database. Finally, we demonstrate that the learned sparse gate values correspond to the biologically interpretable protein motifs. A literaturebased case study illustrates that our model effectively learns to identify the important residues from the sequence.
Ii Related work
Traditional methods focus on extracting features from protein sequences such as autocovariance (AC) [12], conjoint triads (CT) [27] and compositiontransitiondistribution (CTD) [29]
descriptors and training a binary classifier on these features to predict PPIs
[12, 30]. Since these extracted features only summarize the specific aspects of protein sequences such as physicochemical properties, frequencies of local patterns, and the positional distribution of amino acids, they lack enough information about the interactions.Recently, deep learning architectures have been developed to address PPI prediction by automatically extracting useful features from protein sequences [15, 4]
. These methods adopt deepSiamese like neural networks to model the mutual influence between protein sequences. They use encoder based on a convolutional neural network (CNN) to capture local features and recurrent neural network (RNN) to capture sequential and contextualized features from protein sequences. The encoder encodes a pair of sequences to lowerdimensional sequence vectors and a binary classifier predicts the probability of interaction based on these sequence vectors.
Specifically, DPPI [15] uses a deepCNN based Siamese architecture that focuses on capturing local patterns from protein evolutionary profiles [13]. However, it requires extensive effort in datapreprocessing, specifically in constructing evolutionary profiles from protein sequences using PositionSpecific Iterative BLAST (PSIBLAST) [1]. To construct an evolutionary profile for a protein sequence, PSIBLAST searches against NCBI nonredundant protein database with nearly 184 million sequences, which is timeconsuming and makes it unscalable to a large number of protein sequences. However, PIPR [4] incorporates a deep residual recurrent convolutional neural network (RCNN) for PPI prediction using only the sequences of the protein pair. PIPR uses residual RCNN encoder that combines CNN to capture local features and RNN to capture sequential and contextualized features from protein sequences. The sequence representation for each protein is obtained by feeding its sequence through the deep sequence encoder with multiple layers of RCNN units. The nonintuitive mapping from protein sequences to their sequence representation makes the representation difficult to interpret.
Iii Method
A protein sequence is a list of amino acids where is the amino acid at position and L is the length of the sequence. We aim to map a protein sequence to a lowerdimensional Gaussian distribution with formal format as follows: , where , and with denoting the dimension of the Gaussian distribution. We employ the Wasserstein distance as a measure of how different are the encoded Gaussian distributions of protein sequences. In this work, the goal of PPI prediction is to learn a model that predicts a smaller Wasserstein distance between the Gaussian distributions of interacting proteins and a larger difference for noninteracting proteins.
We introduce our deep learning framework for PPI prediction from sequences. The overall architecture of the proposed framework is illustrated in Figure 1. In step 1, the sequence encoder incorporates a bidirectional gated recurrent unit (BiGRU) to encode the amino acid sequence to sequence of vector representations. In step 2, the importance gate models dependencies between all the positions in the sequence, which could allow it to directly model residueresidue dependencies. This enables the model to compute the importance of amino acid in each position based on the sequence of vector representations. In step 3, the representation of sequence is encoded into Gaussian representation with mean and . In step 4, a pairwise ranking loss is employed to minimize the statistical distance between interacting proteins while maximizing the distance for noninteracting proteins.
Iiia Sparse gating sequence encoder
IiiA1 Protein sequence encoder
In step 1 as shown in Figure 1
, the encoder takes a sequence of amino acids and encodes it to a sequence of vector representations  one per amino acid position. In particular, a onehot encoded representation of amino acid
is embedded to a vector representation through an embedding matrix:(1) 
where is the weight of the embedding layer.
To learn the sequential and contextualized representation of the amino acids in the sequence , we adopt the bidirectional Gated Recurrent Unit (BiGRU) that summarizes the sequence information from both directions. The sequence of vector representation is then fed into the bidirectional gated recurrent unit. It contains two encoding processes: a forward encoding which processes the sequence from position 1 to , and a backward encoding which processes the sequence from position to 1.
(2) 
where represents the GRU hidden state for amino acid . The representation of amino acid is the concatenation of forward hidden state and backward hidden state which summarizes the information of whole sequence centered around .
IiiA2 Sparse gating mechanism
Proteins bind to each other at specific binding domains on each protein. These domains can be just a few peptides long or span hundreds of amino acids. For this purpose, we introduce additional gates indicating the activation of each amino acid. The amino acid is active if 0 and is inactive when is 0. The gate values for the amino acid , i.e., with 1 representing high importance. Let be the set of amino acid that are active indicated by their respective gate values. We obtain the representation for protein sequences by scaling the hidden state with their respective gates:
(3)  
(4) 
where represents the concatenation of sequence vectors for positions with 0. denotes the elementwise product between gate values and GRU hidden state . The sparse gate values leads to the sparse representation of the sequence. In contrast, for the positions with , the hidden states of these positions are not included in the representation v. The sparse gates act as the controllers to selectively activate the part of the network to account only for the subset of amino acids of the sequence.
We introduce an auxillary network that takes the GRU hidden states and generates the gate values for each position to determine whether the amino acid at that position is important for PPI prediction. The auxillary network models the longrange pairwise dependencies between amino acids in the sequence. With the auxillary network, our model explicitly considers dependencies between all position in the sequence, which could allow it to directly model residueresidue dependencies. In step 2 of Figure 1, GRU hidden states is transformed to score as:
(5) 
where , and are the weight matrices and biases for the linear layers. Let is a vector of scores for amino acids in the sequence . Next, we convert the vector of scores
so that and . This allows us to quantify the relative contribution of each amino acid for the sequence representation. The softmax function is a simple choice to map the vector to a probability distribution defined as:(6) 
Since the resulting softmax distribution has full support, i.e., for every , it forces all the amino acids in the sequence to receive some probability mass. Sotmax distribution assigns weights to each amino acid and even an unimportant amino acids have small weights, the weights on important amino acids become much smaller for long sequences, leading to degraded performance. However, not all of the amino acids in the sequence contribute towards the certain functions or interactions.
We introduce a sparsity regularization on to only select the important set of amino acids, which may corresponds to the interface and may be important for interaction prediction. We employ sparsemax [23] as gate mechanism:
(7) 
where and represents the Euclidean projection of onto the dimensional probability simplex. The projection is likely to hit the boundary of the simplex, leading to sparse outputs, which allows the encoder to select only an informative subset of amino acids in the sequence. Although sparsemax leads to sparse representation, the sparse weights may only capture few important amino acids but may not identify a relatively long stretches of amino acids.
Furthermore, to enable our model to selectively focus on relatively longer stretches of amino acids, we present fusedmax regularization [24] that not only results in the sparse representations but also encourages the encoder to assign equal weights for contiguous sets of amino acids of the sequence while predicting interactions:
(8) 
where controls the regularization strength and is the tuning parameter that balances the attention to produce sparse outputs and to assign equal weights to the amino acids within each segment. In Eq. 8, the first part projects the scores to the probability simplex and the second part encourages paying equal attention to adjacent amino acids in the sequence. This allows the model to identify long stretches of amino acids that may bind with the residues from the interacting proteins.
IiiA3 Encoding sequences as Gaussian distributions
Within an organism, a given protein may be involved in a complex interplay with various proteins that perform different functions within a cell and even have different sequence patterns. Such differences should be reflected in the uncertainty of its representation . To model the uncertainty about the representation [3, 31], sequence representation is then encoded to and in the final layer of the architecture as shown in step 3 in Figure 1. To ensure that covariance matrices is positive definite, we use Exponential Linear Unit [5] in the final layer.
(9)  
(10) 
where and denote the weight matrices and biases of linear layers that project intermediate representation to mean
and variance
of Gaussian representation of the sequence . We denote a protein sequence with a dimensional Gaussian distribution (), where represents the center point of the sequence representation in the latent space and represents the uncertainty associated with the representation. With an assumption that the different dimensions of the latent Gaussian distribution are independent of each other, the covariance matrix is a diagonal matrix.IiiB Loss function definition
We employ the Wasserstein distance to measure the similarity between the Gaussian distributions of the proteins to make PPI prediction. Wasserstein distance allows the model to capture the transitivity property of PPIs that measures the tendency of proteins to cluster together into functional modules and protein complexes [21].
The Wasserstein distance between two probability measures and is defined as:
(11) 
where
denotes the expected value of a random variable
and the infimum is taken over all joint distributions of random variables X and Y with marginals
and respectively. Wasserstein distance is a welldefined measure that preserves both the symmetry and triangular inequality [10].Wasserstein distance has a closedform solution for two multivariate Gaussian distributions. This allows us to employ Wasserstein distance (abbreviated as ) as similarity measure between the latent Gaussian distributions and :
(12)  
(13) 
Since we focus on diagonal covariance matrices, thus :
(14) 
According to the above equation, the time complexity to compute the between two multivariate Gaussian distributions is linear with the embedding dimension . Since the computation of no longer constitutes a computational challenge, we choose as a measure of distance.
We use a pairwise ranking formulation with respect to the Wasserstein distance to model PPIs:
(15) 
where is the latent Gaussian distribution of sequence , and and represents positive and negative interaction respectively. Specifically, the idea of ranking formulation is to penalize ranking errors based on the the Wasserstein distances between the pairs. The smaller the Wasserstein distance, the larger the possibility of interactions.
Finally, we employ squareexponential loss [19] to enable learning from the known pairwise interactions. Mathematically, the energy between the protein pairs can be defined as =
. Then, the loss function to be optimized is:
(16) 
where represents set of positive interactions and represents set of negative interactions. In our setting, the objective function penalizes the pairwise errors by the energy of the pairs, such that the energy of positive interactions is lower than the energy of negative interactions. Furthermore, such ranking formulation also maximizes the difference in energy between positive pair () and negative pair (). Equivalently, this will make the possibility of interactions between the interacting proteins larger than that of noninteracting proteins.
Finally, we can optimize the parameters (i.e. weights and biases) of the model such that the loss is minimized and the pairwise rankings are satisfied. Specifically, for each protein, the distance with interacting proteins should be smaller than with noninteracting proteins. We term this as the ranking approach since interacting proteins have smaller distance and are ranked higher than noninteracting proteins.
IiiC PPI prediction
The Wasserstein distances between the latent Gaussian distributions of protein sequences corresponds to the possibility of their interaction. However, predicting PPIs by only computing the Wasserstein distance fails to take into account the homodimers, the proteins with identical sequences [15]. The encoded Gaussian representations of these protein sequences will be the same and their Wasserstein distance will be 0 indicating they must interact.
To overcome this limitation, we define pairwise features for all protein pairs by the concatenation of the absolute elementwise differences of means and variances and the elementwise multiplications of the means of their Gaussian representations, . This featurization is effective in modeling the symmetric relationship between proteins. To predict binary interactions, we train a binary classifier on these pairwise features to learn a decision boundary that separates interacting proteins from noninteracting pairs.
IiiD Efficient training
Siamese networks are suitable to train with contrastive loss mentioned in Eq. 16. However, Siamese training is inefficient when the amount of PPIs increases. In particular, the possible number of interactions for proteins is (including selfinteractions), which is computationally intensive for Siamese training. A minibatch of interactions in Siamese training may have multiple occurrences of the same proteins, leading the sequence to be feedforwarded for each interaction. It is sufficient to feedforward each sequence once to compute the loss. To address this problem, we encode the minibatch of protein sequences and retrieve positive and negative interactions that involve them to compute the loss in the minibatch. With this setting, we are only required to feedforward protein sequences compared to pairs in the Siamese setting, which makes our method computationally efficient and scalable to a large number of interactions.
Iv Experiments
We evaluate our method on two realworld datasets: yeast and human proteins to predict their interactions. We use the area under the ROC curve (AUROC) and the average precision (AP) scores as the evaluation metrics. With these evaluation metrics, we expect the positive protein pair to have higher interaction probability compared to negative protein pair.
Iva Experimental Setup
IvA1 Datasets
The datasets for protein sequences of yeast and human proteins are from the EMBLEBI Reference Proteome [8]. The information about the subcellular localization of proteins is extracted from UniProt database [7]. The evolutional protein profiles for yeast and human protein sequences are collected from Rost Lab [14]. We evaluate our proposed model with two types of protein features: (a) amino acid sequences and (b) evolutionary protein profiles constructed from these sequences.
To evaluate the performance of deep learning models, the interaction datasets are downloaded from the uptodate BioGRID interaction database (Release 3.5.169) [25]. The BioGRID database provides a large number of PPIs allowing us to evaluate the scalability of different approaches as well. Only the interactions that correspond to the physical binding between the protein pairs (say ) are considered since these interactions are supported by experimental evidence. The negative samples are generated by randomly sampling from all protein sequence pairs (), that are not yet confirmed by experimental evidence. Furthermore, these negative interactions are filtered based on their subcellular localization, assuming that proteins in the different locations are unlikely to interact although some proteins do translocate [14].
Also, we perform the cluster analysis with the CDHIT
[20] to cluster protein sequences based on a certain similarity threshold that represents sequence identity. We remove the interactions such that no two proteins have a pairwise sequence identity greater than 10%. Table I shows the statistics of the interaction datasets.Data  No. of  No. of  No. of 

proteins  positive pairs  negative pairs  
Yeast  3,651  50,344  50,376 
Human  7,028  73,624  73,628 
Method  Data  Yeast  Human  

AUROC  AP  AUROC  AP  
DPPI [15]  Profiles  0.8910.004  0.8570.007  0.8700.004  0.8350.005  
PIPR [4]  sequences  0.9090.003  0.9120.004  0.8780.002  0.8820.003  
Our method (sparsemax)  Ranking  Profiles  0.8820.003  0.8880.002  0.8840.003  0.8930.004 
Sequences  0.9010.002  0.9040.002  0.8810.002  0.8890.001  
Random Forest  Profiles  0.9080.002  0.9130.003  0.891 0.005  0.8960.005  
Sequences  0.9240.002  0.9250.001  0.8870.002  0.8940.001  
Our method (fusedmax)  Ranking  Profiles  0.8820.006  0.8850.006  0.8730.09  0.8810.01 
Sequences  0.8980.001  0.9000.002  0.8740.002  0.8830.001  
Random Forest  Profiles  0.9060.004  0.9120.005  0.8720.015  0.8770.015  
Sequences  0.9190.003  0.9210.002  0.8810.002  0.8860.001 
Average AUROC and AP scores (with standard deviation) averaged over five independent runs for PPI prediction. * represents statistically significant differences with PIPR (Pvalue
0.005).IvA2 Hyperparameters and training details
For both datasets, we train a sequence encoder with the same configuration. The best hyperparameters of our model are selected based on validation performance. The maximum length of the input sequence to the encoder is 1024 for efficient training. In the datasets, 91.2% of yeast sequences and 86% of human sequences are shorter than 1024 residues.
The encoder consists of a BiGRU layer with 16 hidden units each, to map a protein sequence to a sequence of 32dimensional representation, one per amino acid. Then, this representation is encoded to a latent Gaussian distribution with dimension = 256 ( = 128 for the mean and 128 for the variance of the Gaussian).
All the weight matrices of the encoder layer are initialized using Xavier initialization [11]. The model is trained on a single NVIDIA GeForce RTX 2080 Ti GPU for all experiments using Adam optimizer [18]
with learning rate 0.003 and other default parameters provided by PyTorch
[26]. Our unique approach of encoding unique sequences allows us to train the model efficiently even with large batch sizes. We empirically find that our model converges in a small number of iterations ( 50 for all shown experiments).IvB Results on PPI prediction
We compare our method against the stateoftheart deep learning methods on the uptodate BioGRID interaction datasets. We split the interactions into training, validation, and test sets (0.6:0.2:0.2). All the models are trained on the same training set and the best set of hyperparameters are selected based on their performances on the validation set. Finally, the models are evaluated on independent test sets. Table II
reports the mean AUROC and AP and their standard errors on five independent runs. We perform a twotailed Welch’s ttest and BenjaminiHochberg procedure to adjust pvalue and find that the improvement over PIPR, the stateoftheart method is statistically significant.
With the ranking approach, we expect our model to rank positive interactions higher than negative interactions, i.e. the probability of interactions between interacting protein pairs is greater than that of noninteracting protein pairs. Table II demonstrates that our model ranks positive interactions higher than negative interactions. The ranking based model with sparsemax regularization achieves comparable performance with PIPR. Furthermore, the random forest classifier trained to account for homodimeric interactions improve the model’s performance on both datasets. The best parameters for random forest classifier are selected via grid search.
Furthermore, we evaluate our proposed method on evolutionary protein profiles. Protein profiles constructed from protein sequences capture the correlation between different proteins as well as between different parts of the sequences [6]. Our model trained with protein profiles outperforms DPPI, the stateoftheart deep architecture that uses profiles as the input to their deep Siamese model. Table II shows that our model with sequences achieves comparable or better performance compared to profiles across both datasets. This demonstrates that our model is capable of extracting useful information about interactions from protein sequences and alleviates the expensive process of profile construction from sequences.
IvC Ablation study of framework components
We next evaluate individual model components on the PPI prediction task with yeast dataset.
IvC1 Gaussian representations outperform point representations
We first explore whether the Gaussian representation of sequences improves the performance of the model over point representation. We encode intermediate representation of sequence to point representation instead of Gaussian representation (as in Eq. 9 and 10) and define L2 norm as the similarity between the point representations of sequences and instead of Wasserstein distance (Eq. 12):
(17) 
where represents point representation of sequences . Table III shows that Gaussian representations are better predictors of PPIs compared to point representations.
Model configuration  AUROC  AP  

No gating  0.8800.001  0.8750.003  
Point + RF  Softmax  0.8810.001  0.8770.001 
Fusedmax  0.9090.001  0.9120.002  
Sparsemax  0.9130.001  0.9160.002  
Gaussian + RF  Softmax  0.8820.001  0.8790.002 
Fusedmax  0.9190.003  0.9210.001  
Sparsemax  0.9240.002  0.9250.001 
IvC2 Sparse gating mechanism improves performance
We further demonstrate the importance of the proposed sparse gating mechanism by comparing the performance of the sequence encoder trained with and without various gating mechanism. We train models with different settings of the gate mechanism. Table III demonstrates that the sparse gating mechanism provides a significant improvement over no gate mechanism and softmax in PPI prediction. This indicates that sparse regularization helps the model to selectively activate the important stretches of amino acids that are important to model and predict PPIs.
IvC3 Dimension of Gaussian distribution is important
Finally, we investigate how the dimensionality of the latent Gaussian distributions can affect the model’s performance. Figure 2 shows the plot of the AUROC and the AUPR scores of our method across the two organisms. When the dimension of the Gaussian distribution increases from to , the performance also increases. When , two regularization strategies result in similar performance. Moreover, the performance also remains stable when .
IvD Training time per epoch
Here, we compare the training time of our method and PIPR, the stateoftheart deep learning model [4]
. For fair comparison, we train both models in the same machine on the same dataset and compare only the average training time per epoch in Figure
3. For this experiment, we randomly sample 8k, 16k, 24k, 32k, 40k, 48k, 56k, 64k, 72k, 80k, and 88k training interactions.Figure 3 shows that our method is efficient in comparison to PIPR. PIPR uses a pairwise training process that requires a higher number of matrix multiplications for each interaction. On the other hand, given the large batch of interactions, our model finds the unique set of protein sequences involved in these interactions and encodes them, which significantly reduces the number of matrix multiplication. Once we have the embeddings for these sequences, we can compute the loss based on their respective interactions. As discussed in Section IIID, for instance, if there are 1000 interactions in a batch with 100 unique proteins, our model encodes only 100 protein sequences instead of 1000 protein pairs as in PIPR. Note that this approach allows our model to train on a large batch of interactions, and thus takes less time to train. This demonstrates that our method scales with the number of interactions and is significantly faster than PIPR.
IvE Interpretability
Since our proposed model selectively activates the part of a given sequence, it is important to evaluate whether the selected parts are important. To explore this, we perform the quantitative evaluation on how the amino acids selected by the sparse gating mechanism in our proposed method align with the motifs from the Pfam motif library [9] from GenomeNet [17]. The gate vector for a sequence helps us interpret how much contribution an amino acid on that position signaled by GRU hidden state makes. Since our model computes gate values between 0 and 1 for each amino acid, we consider amino acids with to be active i.e. used by the model for the representation of proteins. Table IV shows the average percentage of amino acids selected by the sparse gating mechanism and their alignment with motifs having biological significance. For instance, for yeast dataset, only 19.24% amino acids (on average) are selected with fusedmax and 59.05% of these selected amino acids aligns with the motif. This illustrates that the amino acids in the sequence selectively activated by our model to learn protein representation align with biologically interpretable motifs.
Dataset  Gating  Selected  Alignment 

amino acids ()  with motifs ()  
Yeast  Sparsemax  8.06  49.96 
Fusedmax  19.24  59.05  
Human  Sparsemax  9.15  48.33 
Fusedmax  23.33  65.63 
In addition, we visualize the activated amino acids and the motifs from Pfam motif library [9] from GenomeNet [17] in Figure 4. For this experiment, we select three proteins: LSM8, SMD2, and RPC11 from the yeast dataset with motifs in different parts of the sequences. Figure 2 and Table IV shows that the model trained with fusedmax achieves similar performance to sparsemax when but gains better alignment with the motifs from Pfam database. So, we train the model with fusedmax and obtain the gate values for each amino acid in these sequences. Red lines in each subfigure of Figure 4 are the important regions identified by our model.
In particular, LSM8 with the sequence of length 109 has two motifs: PF01423, LSM domain at the position from 4 to 65, and PF14807, adaptin AP4 complex epsilon appendage platform at the position from 9 to 65 shown in Figure (a)a. The learned gate value corresponds to the subset of amino acids from position 1 to 65 and aligns with motifs. Similarly, the selected parts of sequences for SMD2 and RPC11 aligns with their motifs even though the motif lies in different parts of sequences. The quantitative and qualitative evaluation of gate vectors shows that our model successfully identifies important amino acids in the sequence for PPI prediction.
V Conclusion
We present a novel deep neural network to model and predict PPIs from variablelength protein sequences. Our proposed framework adopts a recurrent neural network to capture contextualized and sequential information from amino acid sequences. By incorporating a structured and sparse gating mechanism into the sequence encoder, our model successfully selects the important residues on the sequence to learn the sequence representation. Furthermore, our novel approach of encoding sequences to their representation makes the model efficient and scalable to a large number of interactions. Extensive experimental evaluations on various uptodate datasets show its promising performance on binary PPI prediction task. Various case studies demonstrate the ability of our model to provide biological insights to interpret the predictions.
Acknowledgment
This work was supported by the National Science Foundation [NSF1062422 to A.H.], [NSF1850492 to R.L.] and the National Institutes of Health [GM116102 to F.C.]
References
 [1] (1997) Gapped blast and psiblast: a new generation of protein database search programs. Nucleic acids research 25 (17), pp. 3389–3402. Cited by: §II.
 [2] (1973) Principles that govern the folding of protein chains. Science 181 (4096), pp. 223–230. Cited by: §I.
 [3] (2018) Deep gaussian embedding of graphs: unsupervised inductive learning via ranking. In Proc. of International Conference on Learning Representations, Cited by: §IIIA3.
 [4] (201907) Multifaceted proteinprotein interaction prediction based on siamese residual rcnn. Bioinformatics 35 (14), pp. i305–i314. Cited by: §I, §II, §II, §IVD, TABLE II.
 [5] (2015) Fast and accurate deep network learning by exponential linear units (elus). In Proc. of International Conference on Learning Representations, Cited by: §IIIA3.
 [6] (2019) Protein interaction networks revealed by proteome coevolution. Science 365 (6449), pp. 185–189. Cited by: §IVB.
 [7] (2018) UniProt: a worldwide hub of protein knowledge. Nucleic acids research 47 (D1), pp. D506–D515. Cited by: §IVA1.
 [8] (2012) Toward community standards in the quest for orthologs. Bioinformatics 28 (6), pp. 900–904. Cited by: §IVA1.
 [9] (2014) Pfam: the protein families database. Nucleic acids research 42 (D1), pp. D222–D230. Cited by: §IVE, §IVE.
 [10] (1984) A class of wasserstein metrics for probability distributions.. The Michigan Mathematical Journal 31 (2), pp. 231–240. Cited by: §IIIB.

[11]
(2010)
Understanding the difficulty of training deep feedforward neural networks.
In
Proc. of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256. Cited by: §IVA2. 
[12]
(2008)
Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences
. Nucleic acids research 36 (9), pp. 3025–3030. Cited by: §I, §II.  [13] (2015) Evolutionary profiles improve protein–protein interaction prediction from sequence. Bioinformatics 31 (12), pp. 1945–1950. Cited by: §II.
 [14] (2015) More challenges for machinelearning protein interactions. Bioinformatics 31 (10), pp. 1521–1525. Cited by: §IVA1, §IVA1.
 [15] (201809) Predicting proteinprotein interactions through sequencebased deep learning. Bioinformatics 34 (17), pp. i802–i810. External Links: ISSN 13674803, Document Cited by: §I, §II, §II, §IIIC, TABLE II.
 [16] (2016) Neural networkbased clustering using pairwise constraints. Proc. of International Conference on Learning Representations (ICLR) Workshop Track. Cited by: §I.
 [17] (1997) Linking databases and organisms: genomenet resources in japan. Trends in biochemical sciences 22 (11), pp. 442–444. Cited by: §IVE, §IVE.
 [18] (2015) Adam: A method for stochastic optimization. In Proc. of International Conference on Learning Representations, Cited by: §IVA2.
 [19] (2006) A tutorial on energybased learning. Predicting Structured Data 1, pp. 0. Cited by: §IIIB.
 [20] (2006) Cdhit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22 (13), pp. 1658–1659. Cited by: §IVA1.
 [21] (2019) Fusing gene expressions and transitive proteinprotein interactions for inference of gene regulatory networks. BMC systems biology 13 (2), pp. 37. Cited by: §IIIB.
 [22] (2004) Predicting protein–protein interactions using signature products. Bioinformatics 21 (2), pp. 218–226. Cited by: §I.
 [23] (2016) From softmax to sparsemax: a sparse model of attention and multilabel classification. In Proc. of International Conference on Machine Learning, pp. 1614–1623. Cited by: §IIIA2.
 [24] (2017) A regularized framework for sparse and structured neural attention. In Proc. of Advances in Neural Information Processing Systems, pp. 3338–3348. Cited by: §IIIA2.
 [25] (2018) The biogrid interaction database: 2019 update. Nucleic acids research 47 (D1), pp. D529–D541. Cited by: §IVA1.
 [26] (2017) Automatic differentiation in pytorch. In NIPSW, Cited by: §IVA2.
 [27] (2007) Predicting protein–protein interactions based only on sequences information. Proceedings of the National Academy of Sciences 104 (11), pp. 4337–4341. Cited by: §I, §II.
 [28] (2015) Are proteinprotein interfaces special regions on a protein’s surface?. The Journal of chemical physics 143 (24), pp. 12B631_1. Cited by: §I.
 [29] (2010) Prediction of proteinprotein interactions from protein sequence using local descriptors. Protein and Peptide Letters 17 (9), pp. 1085–1090. Cited by: §I, §II.
 [30] (2014) Prediction of proteinprotein interactions from amino acid sequences using a novel multiscale continuous and discontinuous feature set. In BMC bioinformatics, Vol. 15, pp. S9. Cited by: §I, §II.
 [31] (2018) Deep variational network embedding in wasserstein space. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2827–2836. Cited by: §IIIA3.