A novel methodology on distributed representations of proteins using their interacting ligands

by   Hakime Öztürk, et al.
Boğaziçi University

The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture the functional and mechanistic properties of proteins suggesting that a ligand based approach can be utilized in protein representation. In this study, we propose SMILESVec, a SMILES-based method to represent ligands and a novel method to compute similarity of proteins by describing them based on their ligands. The proteins are defined utilizing the word-embeddings of the SMILES strings of their ligands. The performance of the proposed protein description method is evaluated in protein clustering task using TransClust and MCL algorithms. Two other protein representation methods that utilize protein sequence, BLAST and ProtVec, and two compound fingerprint based protein representation methods are compared. We showed that ligand-based protein representation, which uses only SMILES strings of the ligands that proteins bind to, performs as well as protein-sequence based representation methods in protein clustering. The results suggest that ligand-based protein description can be an alternative to the traditional sequence or structure based representation of proteins and this novel approach can be applied to different bioinformatics problems such as prediction of new protein-ligand interactions and protein function annotation.



There are no comments yet.


page 1

page 2

page 3

page 4


A chemical language based approach for protein - ligand interaction prediction

Identification of high affinity drug-target interactions (DTI) is a majo...

Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches

Feature embedding methods have been proposed in literature to represent ...

Predicting protein-protein interactions based on rotation of proteins in 3D-space

Protein-Protein Interactions (PPIs) perform essential roles in biologica...

A Sequence-Based Mesh Classifier for the Prediction of Protein-Protein Interactions

The worldwide surge of multiresistant microbial strains has propelled th...

Binary classification of proteins by a Machine Learning approach

In this work we present a system based on a Deep Learning approach, by u...

Extraction of Protein Sequence Motif Information using PSO K-Means

The main objective of the paper is to find the motif information.The fun...

Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices

Identifying similar protein sequences is a core step in many computation...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture the functional and mechanistic properties of proteins suggesting that a ligand based approach can be utilized in protein representation. In this study, we propose SMILESVec, a SMILES-based method to represent ligands and a novel method to compute similarity of proteins by describing them based on their ligands. The proteins are defined utilizing the word-embeddings of the SMILES strings of their ligands. The performance of the proposed protein description method is evaluated in protein clustering task using TransClust and MCL algorithms. Two other protein representation methods that utilize protein sequence, BLAST and ProtVec, and two compound fingerprint based protein representation methods are compared. We showed that ligand-based protein representation, which uses only SMILES strings of the ligands that proteins bind to, performs as well as protein-sequence based representation methods in protein clustering. The results suggest that ligand-based protein description can be an alternative to the traditional sequence or structure based representation of proteins and this novel approach can be applied to different bioinformatics problems such as prediction of new protein-ligand interactions and protein function annotation.


The aging population is putting drug design studies under pressure as we see an increase in the incidence of complex diseases. Multiple proteins from different protein families or protein networks are usually implicated in these complex diseases such as cancer, cardiovascular, immune and neurodegenerative diseases [33, 18, 34]. Reliable representation of proteins plays a crucial role in the performance of many bioinformatics tasks such as protein family classification and clustering, prediction of protein functions and prediction of the interactions between protein-protein and protein-ligand pairs. Proteins are usually represented based on their sequences [10, 5, 19]. A recent study adapted Word2Vec [24]

, which is a widely-used word-embeddings model in Natural Language Processing (NLP) tasks, into the genomic space to describe proteins as real-valued continuous vectors using their sequences, and utilized these vectors to classify proteins

[2]. However, even though the structure of a protein is determined by its sequence, sequence alone is usually not adequate to completely understand its mechanism. Furthermore, the relationship between fold or architecture and function was shown to be weak, while a strong correlation was reported for architecture and bound ligand [23]. Semantic features such as functional categories and annotations, and Gene Ontology (GO) classes [25, 37, 7, 16] have been suggested to support the functional understanding of proteins, nevertheless these features are usually described in the form of binary vectors preventing the direct use of the provided information. Therefore, a novel approach that defines proteins by integrating functional characterizations can provide important information toward understanding and predicting protein structure, function and mechanism. Ligand-centric approaches are based on the chemical similarity of compounds that interact with similar proteins [32] and have been successfully adopted for tasks such as target fishing, off-target effect prediction and protein-clustering [9, 35]. The use of chemical similarity of the interacting ligands of proteins to group them resulted in both biologically and functionally related protein clusters [22, 28]. Motivated by these results, we propose to describe proteins using their interacting ligands.

In order to define the protein with a ligand centric approach, the description of the ligand is critical. Ligands can be represented in many different forms including knowledge-based fingerprints, graphs, or strings. Simplified Molecular Input Line Entry System (SMILES), which is a character-based representation of ligands, has been used for QSAR studies [36, 6] and protein-ligand interaction prediction [29, 21]. Even though it is a string based representation form, use of SMILES performed as well as powerful graph-based representation methods in protein-ligand interaction prediction and has proven to be computationally less expensive [29]

. A recent study that employed a Recurrent Neural Networks (RNN) based model to describe compound properties also used SMILES to predict chemical properties


. However, such deep-learning based approaches require more computational power. An advantage of SMILES is that it provides a promising environment for the adoption of NLP approaches because it is character based. Distributed word representation models have been widely used in recent studies of NLP tasks, especially with the introduction of Word2Vec

[24]. The model requires a large amount of text data to learn the representations of words to describe them in low-dimensional space as real valued vectors. These vectors comprise the syntactic and semantic features of the words, e.g., the vectors of words with similar meanings are also similar.

In this study, we introduce SMILESVec, in which we adopted the word-embeddings approach to define ligands by utilizing their SMILES strings. Ligands are represented by learning features from a large SMILES corpus via Word2Vec [24], instead of using manually constructed ligand features as it is done in fingerprint models. We then describe each protein using the average of its interacting ligand vectors that are built by SMILESVec. We followed a similar pipeline for evaluation that is presented in [4] in which the authors compared the performances of different clustering algorithms on the task of detecting remote homologous protein families. We measured how well SMILESVec-based protein representation describes proteins within a protein clustering task by using two state-of-the-art clustering algorithms; Transitive Clustering (TransClust) [41] and Markov Clustering Algorithm (MCL) [14].

The performance of clustering using SMILESVec-based protein representation was compared with that using the traditional BLAST, MACCS-based [40] and Extended Fingerprint-based protein representations as well as the recently proposed distributed protein vector representation, which is called ProtVec [2]. ASTRAL data set (A-50) of SCOPe database was used as benchmark [8, 15].

The results showed that the representation of proteins with their ligands is a promising method with competitive F-scores in the protein clustering task, even though no sequence or structure information is used. SMILESVec can be an alternative approach to binary-vector based fingerprint models for ligand-representation. The ligand-based protein representation might be useful in different bioinformatics tasks such as identifying new protein-ligand interactions and protein function annotations.

Materials and Methods

Data set

The ASTRAL data sets are the part of Structural Classification of Proteins (SCOP) collection and classified under folds, families and super-families [15]. A family denotes a group of proteins with typically distinct functionalities but also with high sequence similarities, whereas a super-family is a group of protein families with structural and functional similarities amongst families. The ASTRAL data sets are named based on the minimum sequence similarity of the proteins that they comprise. For instance, ASTRAL 50 (A-50) data set includes proteins with at most 50% sequence similarity (http://scop.berkeley.edu/astral/subsets/ver=1.75&seqOption=1). In this study, we used A-50 data set from SCOP 1.75 version to demonstrate the performance of the protein representation methods and considered clustering into families and super-families for evaluation. Families and super-families with single protein were removed while preparing the data [4]. We used the same protein pairs that [4] used for A-50 to compute similarity scores

Collection of Protein-Ligand Interactions

First, the corresponding UniProt identifiers were extracted for each protein in A-50 dataset using Bioservices Python package [11]. Then, the interacting ligands with their corresponding canonical SMILES were retrieved from ChEMBL using ChEMBL web services [12] (Data collected on Dec 30, 2017). The workflow of protein-ligand interaction extraction is illustrated in Figure 1. The collected interactions were used to build the proposed SMILESVec-based protein representations.

Figure 1: Extraction of protein-ligand interactions. As an example protein, Myosin Binding ProteinC is provided as input with its corresponding SCOPe ID: d2yxma_

Distributed Representation of Proteins and Ligands

The Word2Vec model, which is based on feed-forward neural networks, has been previously adopted to represent proteins using their sequences

[2]. The approach, that we will refer to as ProtVec throughout the article, improved the performance for the protein classification problem. In this study, we used the Word2Vec model with the Skip-gram approach to consider the order of the surrounding words. In the biological context, we can use the string representations of proteins/ligands (e.g., FASTA sequence for proteins and SMILES for ligands) in textual format and define words as sub-sequences of these representations.

Figure 2 illustrates a sample protein sequence and its sequence list (biological words) as well as a sample ligand SMILES and its corresponding sub-sequences (chemical words). The biological words which are referred to as sequence-lists are created with a set of three characters of non-overlapping sub-sequences for each list that starts from the character indices 1,2, and 3, respectively, therefore leading to three sequence lists [2]. The chemical words were created as 8-character long overlapping substrings of SMILES with sliding window approach. As shown in Figure 2, the SMILES string “C(C1CCCCC1)N2CCCC2” is divided into the following chemical words: “C(C1CCCC”, “(C1CCCCC”, “C1CCCCC1”, “1CCCCC1), “CCCCC1)N”, “CCCC1)N2”, … , “)N2CCCC2”. We performed several experiments in which word size varied in the range of 4-12 characters and 8-charactered chemical words obtained the best results.

Figure 2: Representation of biological and chemical words .

With the use of the Word2Vec model we were able to describe complex structures using their simplified representations. For each subsequence (word) that was extracted from protein sequence/ligand SMILES, Word2Vec produced a real-valued vector that is learned from a large training set. The vector learning is based on the context of each subsequence (e.g. its surrounding subsequences) and can detect some important subsequences that usually occur in the same contexts. Therefore, with the help of the neural-network based nature of Word2Vec, every subsequence of a protein sequence/ligand SMILES was described in a semantically meaningful way. The Word2Vec model defined a vector representation for each of the 3-residue subsequences of the proteins. Protein vectors were constructed as the average of the summation of these subsequence vectors as described in Equation 1 where refers to the 100-dimensional real-valued vector for the subsequence and is equal to the total number of sub-sequences that can be extracted from a protein sequence. For proteins, 550K protein sequences from UniProt were used to train Word2Vec with skip-gram approach.


Similarly, the Word2Vec model produced a real-valued vector for each SMILES word and the corresponding ligand vector is constructed as the average of the summation of these SMILES word vectors as described in Equation 2. represents the Word2Vec output for the 8-character long subsequence of the SMILES string and indicates the total number of these SMILES subsequences (words). We will refer to ligand vectors as SMILESVec throughout the article. For learning, 1.7M canonical SMILES from CHEMBL database (ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/) were retrieved [39]. The skip-gram approach with vector size set to 100 was used.


We also used the Word2Vec model to learn embeddings for the characters in the SMILES alphabet. Therefore instead of word-level, we created char-level embeddings for the unique characters that appear in SMILES in data set (58 chars). Equation 3 describes where in this case represents the total number of the characters in a SMILES.


We further investigated an important aspect when working with SMILES representation, since there are several valid SMILES for a single molecule. Canonicalization algorithms were coined for the purpose of generating a unique SMILES for a molecule, however couldn’t prevent the diversity that came with different canonicalization algorithms. Thus, it is not that surprising that canonical SMILES definition can differ from database to database. ChEMBL uses Accelrys’s Pipeline Pilot that uses an algorithm derived from Daylight’s [30], whereas Pubchem uses OpenEye software [26] for canonical SMILES generation [3]. The most evident difference between the canonical SMILES of two databases is that ChEMBL includes isomeric information, whereas Pubchem does not. Therefore, even though we collected the SMILES of the interacting ligands from the ChEMBL database, we both experimented learning chemical words and characters from ChEMBL and Pubchem canonical SMILES corpora both separately and together (combined).

We can represent a protein/ligand vector as the output of the maximum or minimum functions, where is the total number of the subsequences that are created from the protein/ligand sequence and is the dimensionality of the vector (i.e. the number of features). represents the minimum value of the feature among (Equation 4). To obtain a protein vector of minimum, is selected for each feature as defined in Equation 5. Similarly, represents the maximum value of the feature among (Equation 6) and protein vector of maximum is created as in Equation 7 for number of features. The concatenation of these minimum and maximum protein vectors results in a vector with twice the dimensionality of the original vectors [13]. The min/max representation is described in Equation 8.


Protein Similarity Computation

We used BLAST and ProtVec-based methods as baseline to compare to the ligand-centric protein representation that we proposed.


Basic Local Alignment Tool (BLAST) reports the similarity between protein sequences using local alignment [1]. For the ASTRAL data sets, we used both BLAST sequence identity values and BLAST e-values that were previously obtained [4] with all-versus-all BLAST with e-value threshold of 100.

Word Frequency-based Protein Similarity

Word frequency-based protein similarity method uses three-charactered protein words that are created as it was explained in Section Distributed Representation of Proteins and Ligands. However, instead of learning process, we simply count the occurrence of protein words that appear in a protein sequence. In order to compute similarity between two proteins, we used the formula depicted in Equation 9 [38]:


where is the total number of unique words created from protein sequences and , is the frequency of words of type in protein and is the frequency of words of type in protein .

ProtVec-based Protein Similarity

In ProtVec based clustering, protein vectors were constructed as defined in Section 2.3, either with the average or minmax method. Cosine similarity function was used to compute the similarity between two protein vectors

and as in Equation 10 where denotes the size (dimensionality) of the vectors.


SMILESVec-based Protein Similarity

First, the ligand vectors were constructed by SMILESVec approach described in Section 2.3. Then each protein was represented as the average of the vectors of the ligands they interact with. Equation 11 describes the construction of a protein vector from its binding ligands where SMILESVec represents the ligand vector and represents the total number of ligands that the protein interacts with.


Similarly, protein similarity is computed using the cosine similarity function.

Fingerprint-based Protein Similarity

We used two popular fingerprint-based compound representation methods as an alternative to SMILESVec, namely MACCS and Extended Fingerprint. Chemical Development Kit descriptors were used to build MACCS and Extended fingerprints of the ligands [40]. Fingerprints are binary (absence/presence) vector representations of ligands where each bit refers to chemical features such as specific substructures and rings in which MACCS and Extended Fingerprint encode 166 and 1024 bits, respectively. The proteins were represented as described in Equation 12 in which fingerprints were used to represent each interacting ligand.


Fingerprints were used in order to compare a knowledge-based ligand description with a data-driven approach (SMILESVec).

SMILES word frequency-based Protein Similarity

For each interacting ligand of a protein, 8-character-long SMILES words were created as explained in Section Distributed Representation of Proteins and Ligands. Then similarity between two proteins were computed as in Equation 9 using the collection of chemical words of their respective interacting ligands.

Clustering Algorithms

We evaluated the effectiveness of the different protein representation approaches for the task of protein clustering. Transitivity Clustering (TransClust), which has been shown to produce the best F-measure score amongst several other algorithms in protein clustering [4] and the commonly used Markov Clustering Algorithm (MCL) were used as the protein clustering algorithms.

Transitivity Clustering (TransClust)

TransClust is a clustering method that is based on the weighted transitive graph projection problem [41]. The main idea behind TransClust is to construct transitive graphs by adding or removing edges from an intransitive graph using a weighted cost function. Weighted cost function is calculated as the distance between a user-defined threshold and a pairwise similarity function. TransClust connects two proteins on the network if their similarity is greater than the user-defined threshold. The graph is expanded by adding or removing edges until it becomes a disjoint union of cliques [4].

Markov Clustering Algorithm (MCL)

MCL is a network clustering algorithm that considers the weights of the edges (flows) in the network [14] and utilized to build a flow matrix of the network. The algorithm is implemented for a given number of iterations. The iteration number is called granularity inflation defining the homogeneity and the heterogeneity of the clusters. We used the default value (2.0) of the inflation parameter in MCL.


In order to evaluate the performance of the proposed methods, we utilized the F-measure, precision and recall metrics. These metrics are widely used in the evaluation of classification methods. To adapt these metrics into the assessment of clustering task, we followed the formulation explained by Bernardes and co-workers


For a data set of proteins, let us assume represents the number of proteins that belong to the family or class, is the number of proteins that are placed in the cluster and represents the number of proteins that belong to the family and are placed in the cluster. Precision of cluster with respect to the family is computed as , whereas recall is defined as . Finally we can define F-measure as in Equation 13:


indicates that for each family , we compute precision and recall values for each corresponding cluster, and choose the maximum score.


We evaluated the performance of five different protein similarity computation approaches in clustering of the A-50 dataset. The similarity approaches were BLAST, ProtVec, SMILESVec, MACCS, and Extended Fingerprint, the first two of which are protein sequence based similarity methods, whereas the latter three utilize the ligands to which proteins bind. We took word-frequency based protein similarity methods that use protein sequences and compound SMILES strings, respectively, as the baseline. Average (avg) and minimum/maximum (min/max) of the vectors were taken to build combined vectors for ProtVec and SMILESVec from their subsequence vectors.

We performed our experiments on the A-50 dataset using two different clustering algorithms, TransClust and MCL. The ligand-based (SMILESVec, MACCS and Extended Fingerprint) protein representation approaches require a protein to bind to at least one ligand in order to define a ligand-based vector for that protein. Therefore, we removed the proteins with no ligand binding information from both data sets. Table 1 provides a summary of A-50 data set before and after filtering.

Data set Num. Sequences Super-families Families
Before filtering
A50 10816 1080 2109
After filtering
A50 1639 425 652
Table 1: Distribution of families and super-families in A-50 data set before and after filtering

Table 2 summarizes the top-10 most frequent families and super-families before and after filtering. We can observe that less than half of the frequent families and super-families remained in top-ten list such as Immunoglobulin (b.1.1) and Fibronectin type III (b.1.2) super-families and their descendants, Immunoglobulin I set (b.1.1.4) and Fibronectin type III (b.1.2.1) families, respectively. Super-families and families that weren’t initially in the top-10 list such as Protein-kinase like (d.144.1) super-family and nuclear-receptor binding domain (a.123.1) and their respective descendant families also made it among the frequent set of proteins when ligand interactions were taken into account.

In the filtered data set in which all proteins have an interacting ligand, there were 1057 proteins with fewer than 200 ligands ( of the whole proteins) 101 of which were proteins with single ligands ( of the whole proteins). There were 67 proteins with more than 10000 interacting ligands (), thus increasing the average of the interacting ligands to 1791. The protein with the highest number of interacting ligands was d2dpia2 (DNA polymerase iota), a protein involved in DNA repair [20] and implicated in esophageal squamous cell cancer [43] and breast cancer [42], with 115018 ligands.

Before filtering After filtering
Super-family # prots. Family # prots. Super-family # prots. Family # prots.
P-loop containing nucleoside
triphosphate hydrolases
Fibronectin type III
Protein kinase-like
Protein kinases, catalytic subunit
Rossmann-fold domain
Tyrosine-dependent oxidoreductases
P-loop containing nucleoside
triphosphate hydrolases
Fibronectin type III
Canonical RNA-binding domain
Eukaryotic proteases
”Winged helix” DNA-binding domain
Immunoglobulin I set
Rossmann-fold domain
EGF-type module
G proteins
Trypsin-like serine proteases
Immunoglobulin I set
Immunoglobulin V set
Fibronectin type III
SH2 domain
Nucleic acid-binding proteins
Classic zinc finger, C2H2
Nuclear receptor
ligand-binding domain
Phosphate binding protein-like
SH2 domain
Fibronectin type III
PDZ domain
Cysteine proteinases
Pleckstrin-homology domain
N-acetyl transferase, NAT
Nuclear receptor
ligand-binding domain
Tyrosine-dependent oxidoreductases
Table 2: Summary of the top-10 frequent families and super-families in A-50 data set of SCOPe given with the family name and the number of proteins that belong to them.

Finally, we assessed the performance of the clustering algorithms with F-measure values for two different clustering scenarios, family and super-family clustering. TransClust requires a user-defined threshold to identify clusters, therefore in order to choose the best threshold value, we computed the F-measure values for similarity threshold range of [0, 1] with 0.001 step-size for similarity computation methods that outputs in the range of 0-1. For BLAST, range of [0, 100] with step-size value of 0.05 was tested for similarity threshold. We chose the similarity thresholds that gave the best F-measure for super-family and family to decide the final clusters.

Table 3 reports the F-measure values for family and super-family clustering and the number of clusters that are detected with TransClust and MCL algorithms, respectively.

Between TransClust and MCL, TransClust produced better F-measure values in all representation methods on A-50 data set. The results obtained by both clustering algorithms were better in family clustering than in super-family clustering, which was an expected outcome since detection of distantly related proteins is a much harder task.

Both clustering algorithms relied on similarity scores in order to group proteins. Among the protein sequence-based similarity methods, the poorest clustering performance in super-family/family (0.350/0.500) belonged to BLAST with e-value, the baseline. Protein word frequency (0.686/0.744) obtained the best performance on the A-50 dataset in super-family and family clustering, respectively. The performances of the ProtVec Avg (0.681/0.739) and the ligand-based protein representation methods followed the best result closely. Though, bringing in a semantic aspect with learning through Word2Vec model, ProtVec-based similarity (avg and minmax), was outperformed by the straightforward word-frequency based approach.

The results also showed than average-based combination method (ProtVec avg) was better than min/max-based combination method (ProtVec minmax) to build a single protein vector from subsequence vectors in the protein clustering task. Since min/max-based combination method did not perform well in sequence-based protein similarity, we did not test the technique for SMILES-based protein similarity approaches.

Transclust MCL
Super-family Family Super-family Family
No.Clusters F-measure No.Clusters F-measure No. Clusters F-measure No. Clusters F-measure
Protein sequence based
Blast (e-val) A-50 1596 0.350 1636 0.500 728 0.290 728 0.379
Blast (identity) A-50 606 0.595 660 0.631 783 0.540 783 0.592
Protein Word frequency A-50 708 0.686 688 0.744 411 0.590 411 0.606
ProtVec Avg (word) A-50 655 0.681 704 0.739 1001 0.596 1001 0.665
ProtVec Avg (char) A-50 707 0.674 707 0.729 1017 0.590 1017 0.662
ProtVec MinMax (word) A-50 586 0.667 704 0.718 1014 0.590 1014 0.662
Ligand based
SMILES word frequency A-50 801 0.624 957 0.704 312 0.470 312 0.475
SMILESVec (word, chembl) A-50 621 0.677 730 0.735 867 0.608 867 0.667
SMILESVec (word, pubchem) A-50 573 0.668 692 0.730 857 0.604 857 0.664
SMILESVec (word, combined) A-50 617 0.675 764 0.735 894 0.607 894 0.668
SMILESVec(char, chembl) A-50 636 0.678 710 0.729 999 0.596 999 0.668
SMILESVec(char, pubchem) A-50 714 0.671 715 0.729 977 0.595 977 0.667
SMILESVec(char, combined) A-50 712 0.675 712 0.739 1006 0.595 1006 0.669
MACCS A-50 589 0.679 683 0.736 874 0.606 874 0.667
Extended Fingerprint A-50 607 0.680 756 0.732 744 0.609 744 0.655
Table 3: Performances of TransClust and MCL algorithms in super-family and family clustering for all protein similarity computation methods with F-measure values.

Among the ligand based representation methods, we examined the performance of the word-based embeddings and character-based embeddings as well as the effect of the source of the training data set on embeddings. We collected canonical SMILES from both ChEMBL (1.7M) and Pubchem (2.3M) databases. The SMILES strings of the interacting ligands were only collected from ChEMBL as explained in Section Collection of Protein-Ligand Interactions. The main difference between these two databases is that ChEMBL allows the isometric information of the molecule to be encoded within SMILES. The results clearly indicated that the choice of the training set for embedding learning is important where SMILES was concerned. In our case, since SMILES of the interacting ligands of A-50 data set was collected from ChEMBL database, the performance of the SMILESVec in which embeddings were learned from training with ChEMBL SMILES rather than Pubchem SMILES was notably better.

We also investigated whether using the combination of the SMILES corpus of ChEMBL and Pubchem can improve the performance of SMILESVec embeddings. We indeed reported an improvement on character-based embedding in family clustering (0.739) whereas word-based embedding produced F-measure values higher than the Pubchem-based learning and lower than the ChEMBL-based learning. We can suggest that the increase in the performance of the character-based learning with the combination of two different SMILES corpora might be positively correlated with the increase in SMILES samples, while the number of unique letters that appear in the SMILES did not significantly change between databases (e.g. absence/presence of the few characters that represent isometry information). However, with the word-based learning, we observed that there was significant increase in the variety of the chemical words, thus the combined SMILES corpus model did not work as well as it did in character-based learning. This result suggests that the size of the learning corpus may affect the representation of the embeddings, thus we might suggest a larger SMILES corpus could lead to better character-based embeddings for SMILESVec.

Considering only ChEMBL trained SMILESVec, we observe that even though producing comparable scores, word-based approach was better than character-based SMILESVec in terms of F-measure in family clustering. In super-family clustering however, character-based approach performs as well as word-based SMILESVec. Similarly, ProtVec is also better represented in word-level rather than character-level.

The ligand-based protein representation methods, SMILESVec and MACCS-based approach performed almost as well as ProtVec in family and super-family clustering with TransClust algorithm, even though no protein sequence information was used. With MCL, a lower clustering performance was obtained compared to TransClust, and both SMILESVec and MACCS-based method produced slightly better F-measure than ProtVec Avg in both super-family and family clustering. We can suggest that, since ligand-based protein representation methods capture indirect function information through ligand binding, they were recognizably better at detecting super-families than families compared to sequence-based ProtVec on a relatively distant data set. Furthermore, SMILESVec, a text-based unsupervised learning model, produced comparable F-measure scores to MACCS and Extended fingerprints, which are binary vectors based on human-engineered feature descriptions.

Table 4 reports the Pearson correlations [31] among the protein similarity computation methods. Comparison with BLAST e-value resulted in negative correlation, as expected, since e-values closer to zero indicate high match (similarity). Ligand based protein representation methods had higher correlation values with BLAST e-value than protein-sequence based methods. We also observed strong correlation among the ligand-based protein representation methods, suggesting that, regardless of the ligand representation approach, the use of interacting ligands to represent proteins provides similar information.

Method Method Pearson correlation
BLAST (e-value) BLAST (identity) -0.109
BLAST (e-value) Protein word frequency -0.250
BLAST (e-value) ProtVec (avg) -0.291
BLAST (e-value) SMILESVec (word, chembl) -0.335
BLAST (e-value) SMILESVec (char, chembl) -0.207
BLAST (e-value) MACCS -0.336
SMILESVec (word, chembl) MACCS 0.895
SMILESVec (char, pubchem) MACCS 0.590
SMILESVec (word, chembl) SMILESVec (char, pubchem) 0.682
SMILESVec (word, chembl) Extended Fp. 0.938
Extended Fp. MACCS 0.937
Table 4: Pearson correlation between protein similarity methods

We further investigated a case in which similar super-family clusters were produced with SMILESVec-based protein similarity and ProtVec protein similarity using TransClust algorithm. We observed that Fibronectin Type III proteins (7 proteins) were clustered together when SMILESVec was used, whereas using ProtVec placed them into four different clusters; one cluster contained four of those proteins, another cluster contained a single protein and the other two proteins were part of other clusters. The protein that was clustered by itself (SCOPe ID:d1n26a3, Human Interleukin-6 Receptor alpha chain) had two interacting ligands (CHEMBL81;Raloxifene and CHEMBL46740;Bazedoxifene) that were also shared by a protein (SCOPe ID:d1bqua2,Cytokine-binding region of GP130) clustered separately with ProtVec. Thus, we can suggest that using information on common interacting ligands, SMILESVec achieved to combine these seven proteins into a single cluster, while ProtVec failed to do so with a sequence-based approach.


In this study, we first propose a ligand-representation method, SMILESVec, which uses a word embeddings model. Then, we represent proteins using their interacting ligands. In this approach, the interacting ligands of each protein in the data set are collected. Then, the SMILES string of each ligand is divided into fixed-length overlapping substrings. These created substrings are then used to build real-valued vectors with the Word2vec model and then the vectors are combined into a single vector to represent the whole SMILES string. Finally, protein vectors are constructed by taking the average of the vectors of their ligands. The effectiveness of the proposed method in describing the proteins was measured by performing clustering on ASTRAL 50 (A-50) dataset from the SCOPe database using two different clustering algorithms, TransClust and MCL. Both of these clustering algorithms use protein similarity scores to identify cliques. SMILESVec based protein representation was compared with other protein representation methods, namely BLAST and ProtVec, both of which depend on protein sequence to measure protein similarity, and the MACCS and Extended Fingerprint binary fingerprint based ligand-centric protein representation approaches. The performance of the clustering algorithms, as reported by F-measure, showed that protein word-frequency based similarity model was a better alternative to BLAST e-value or sequence identity to measure protein similarity. Furthermore, ligand-based protein representation methods also produced comparable F-measure scores to ProtVec.

Using SMILESVec, we were able to define proteins based on their interacting ligands even in the absence of sequence or structure information. SMILESVec-based protein representation had better clustering performance than BLAST and comparable clustering performance to protein word-frequency based method, both of which use protein sequences. We should emphasize that SCOPe data sets were constructed based on protein similarity, thus high performance with the protein sequence-based models in family/super-family clustering is no surprise. However, having ligand-based protein representation methods, either learning from SMILES or represented with binary compound features, performing as well as protein sequence-based models is quite intriguing and promising.

SMILESVec and MACCS representation performed similarly in the task of protein clustering and better than Extended Fingerprint representation, suggesting that the word-embeddings approach that learns representations from a large SMILES corpus in an unsupervised manner is as accurate as a knowledge-based fingerprint model. We propose that the ligand-based representation of proteins might reveal important clues especially in protein-ligand interaction related tasks like drug specificity or identification of proteins for drug targeting. The similarity between a candidate ligand and the SMILESVec for a protein can be used as an indicator for a possible interaction.

We would like to mention that ASTRAL data sets contain domains rather than full length proteins while CHEMBL collects protein - ligand interaction information based on the whole protein sequence from UniProt. A multidomain protein may have multiple and diverse chemotypes of ligands binding to each domain and retrieving ligand information based on the full length protein may lump this disparate information together, leading to loss of information on domain specific ligand interactions. The performance of domain sequence based methods is therefore at an advantage because family/superfamily assignment in SCOPe is also based on domain sequence while the ligand based approach we use in SMILESVec uses more noisy data. Despite this disadvantage, ligand based approach performs as well as the sequence based approaches. We hypothesize that if domain - ligand interactions are taken into account, ligand based approaches would have higher performance.

The study we conducted here also showed that SMILES description is sensitive to the database definition conventions, therefore SMILES strings requires careful consideration. Since we collected the protein-ligand interaction and ligand SMILES information from ChEMBL database to represent proteins, building SMILESVec vectors from the chemical words trained in ChEMBL SMILES corpus yielded better F-measure than the model in which the Pubchem SMILES corpus was used for training of the chemical words.

We showed that ligand-centric protein representation performed at least as well as protein sequence based representations in the clustering task even in the absence of sequence information. Ligand-centric protein representation is only available for proteins with at least one known ligand interaction, while a sequence based approach can miss key functional/mechanistic properties of the protein. The orthogonal information that can be obtained from the two approaches has been previously observed [27]. As future work, we will investigate combining both sequence and ligand information in protein representation. We believe that this approach will provide a deeper understanding of protein function and mechanism toward the use of these representations in clustering and other bioinformatics tasks such as function annotation and prediction of novel protein - drug interactions.


TUBITAK-BIDEB 2211-E Scholarship Program (to HO) and BAGEP Award of the Science Academy (to AO) are gratefully acknowledged. We thank Prof. Kutlu O. Ulgen and Mehmet Aziz Yirik for helpful discussions.


This work is funded by Bogazici University Research Fund (BAP) Grant Number 12304.


  •  1. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, 1990.
  •  2. E. Asgari and M. R. Mofrad. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one, 10(11):e0141287, 2015.
  •  3. K. V. Balakin. Pharmaceutical data mining: approaches and applications for drug discovery, volume 6. John Wiley & Sons, 2009.
  •  4. J. S. Bernardes, F. R. Vieira, L. M. Costa, and G. Zaverucha. Evaluation and improvements of clustering algorithms for detecting remote homologous protein families. BMC bioinformatics, 16(1):34, 2015.
  •  5. C. Cai, L. Han, Z. L. Ji, X. Chen, and Y. Z. Chen.

    Svm-prot: web-based support vector machine software for functional classification of a protein from its primary sequence.

    Nucleic acids research, 31(13):3692–3697, 2003.
  •  6. D.-S. Cao, J.-C. Zhao, Y.-N. Yang, C.-X. Zhao, J. Yan, S. Liu, Q.-N. Hu, Q.-S. Xu, and Y.-Z. Liang. In silico toxicity prediction by support vector machine and smiles representation-based string kernel. SAR and QSAR in Environmental Research, 23(1-2):141–153, 2012.
  •  7. R. Cao and J. Cheng. Integrated protein function prediction by mining function associations, sequences, and protein–protein and gene–gene interaction networks. Methods, 93:84–91, 2016.
  •  8. J.-M. Chandonia, N. K. Fox, and S. E. Brenner. Scope: Manual curation and artifact removal in the structural classification of proteins–extended database. Journal of molecular biology, 429(3):348–355, 2017.
  •  9. Y.-Y. Chiu, J.-H. Tseng, K.-H. Liu, C.-T. Lin, K.-C. Hsu, and J.-M. Yang. Homopharma: A new concept for exploring the molecular binding mechanisms and drug repurposing. BMC genomics, 15(9):S8, 2014.
  •  10. K.-C. Chou. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Structure, Function, and Bioinformatics, 43(3):246–255, 2001.
  •  11. T. Cokelaer, D. Pultz, L. M. Harder, J. Serra-Musach, and J. Saez-Rodriguez. Bioservices: a common python package to access biological web services programmatically. Bioinformatics, 29(24):3241–3242, 2013.
  •  12. M. Davies, M. Nowotka, G. Papadatos, N. Dedman, A. Gaulton, F. Atkinson, L. Bellis, and J. P. Overington. Chembl web services: streamlining access to drug discovery data and utilities. Nucleic acids research, 43(W1):W612–W620, 2015.
  •  13. C. De Boom, S. Van Canneyt, T. Demeester, and B. Dhoedt. Representation learning for very short texts using weighted word embedding aggregation. Pattern Recognition Letters, 80:150–156, 2016.
  •  14. A. J. Enright, S. Van Dongen, and C. A. Ouzounis. An efficient algorithm for large-scale detection of protein families. Nucleic acids research, 30(7):1575–1584, 2002.
  •  15. N. K. Fox, S. E. Brenner, and J.-M. Chandonia. Scope: Structural classification of proteins—extended, integrating scop and astral data and classification of new structures. Nucleic acids research, 42(D1):D304–D309, 2013.
  •  16. M. Frasca and N. Cesa-Bianchi. Multitask protein function prediction through task dissimilarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2017.
  •  17. G. B. Goh, N. O. Hodas, C. Siegel, and A. Vishnu. Smiles2vec: An interpretable general-purpose deep neural network for predicting chemical properties. arXiv preprint arXiv:1712.02034, 2017.
  •  18. J. X. Hu, C. E. Thomas, and S. Brunak. Network biology concepts in complex disease comorbidities. Nature Reviews Genetics, 2016.
  •  19. M. J. Iqbal, I. Faye, A. M. Said, and B. B. Samir. A distance-based feature-encoding technique for protein sequence classification in bioinformatics. In Computational Intelligence and Cybernetics (CYBERNETICSCOM), 2013 IEEE International Conference on, pages 1–5. IEEE, 2013.
  •  20. R. Jain, J. R. Choudhury, A. Buku, R. E. Johnson, L. Prakash, S. Prakash, and A. K. Aggarwal. Mechanism of error-free dna synthesis across n1-methyl-deoxyadenosine by human dna polymerase-. Scientific reports, 7:43904, 2017.
  •  21. Jastrzke. Learning to smile (s).
  •  22. M. J. Keiser, B. L. Roth, B. N. Armbruster, P. Ernsberger, J. J. Irwin, and B. K. Shoichet. Relating protein pharmacology by ligand chemistry. Nature biotechnology, 25(2):197, 2007.
  •  23. A. C. Martin, C. A. Orengo, E. G. Hutchinson, S. Jones, M. Karmirantzou, R. A. Laskowski, J. B. Mitchell, C. Taroni, and J. M. Thornton. Protein folds and functions. Structure, 6(7):875–884, 1998.
  •  24. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  •  25. A. C. Nascimento, R. B. Prudêncio, and I. G. Costa. A multiple kernel learning algorithm for drug-target interaction prediction. BMC bioinformatics, 17(1):46, 2016.
  •  26. T. OEChem. Openeye scientific software. Inc., Santa Fe, NM, USA, 2012.
  •  27. M. J. O’Meara, S. Ballouz, B. K. Shoichet, and J. Gillis. Ligand similarity complements sequence, physical interaction, and co-expression for gene function prediction. PloS one, 11(7):e0160098, 2016.
  •  28. H. Öztürk, E. Ozkirimli, and A. Özgür. Classification of beta-lactamases and penicillin binding proteins using ligand-centric network models. PloS one, 10(2):e0117874, 2015.
  •  29. H. Öztürk, E. Ozkirimli, and A. Özgür. A comparative study of smiles-based compound similarity functions for drug-target interaction prediction. BMC bioinformatics, 17(1):128, 2016.
  •  30. G. Papadatos and J. P. Overington. The chembl database: a taster for medicinal chemists. Future, 6(4):361–364, 2014.
  •  31. K. Pearson. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58:240–242, 1895.
  •  32. A. Peón, C. C. Dang, and P. J. Ballester. How reliable are ligand-centric methods for target fishing? Frontiers in chemistry, 4, 2016.
  •  33. P. Poornima, J. D. Kumar, Q. Zhao, M. Blunder, and T. Efferth. Network pharmacology of cancer: From understanding of complex interactomes to the design of multi-target specific therapeutics from nature. Pharmacological research, 111:290–302, 2016.
  •  34. J. A. Santiago and J. A. Potashkin. A network approach to clinical intervention in neurodegenerative diseases. Trends in molecular medicine, 20(12):694–703, 2014.
  •  35. M. Schenone, V. Danvcik, B. K. Wagner, and P. A. Clemons. Target identification and mechanism of action in chemical biology and drug discovery. Nature chemical biology, 9(4):232–240, 2013.
  •  36. J. Schwartz, M. Awale, and J.-L. Reymond. Smifp (smiles fingerprint) chemical space for virtual screening and visualization of large databases of organic molecules. Journal of chemical information and modeling, 53(8):1979–1989, 2013.
  •  37. J.-Y. Shi, S.-M. Yiu, Y. Li, H. C. Leung, and F. Y. Chin. Predicting drug–target interaction for new drugs using enhanced similarity measures and super-target clustering. Methods, 83:98–104, 2015.
  •  38. D. Vidal, M. Thormann, and M. Pons. Lingo, an efficient holographic text based method to calculate biophysical properties and intermolecular similarities. Journal of chemical information and modeling, 45(2):386–393, 2005.
  •  39. Y. Wang, S. H. Bryant, T. Cheng, J. Wang, A. Gindulyte, B. A. Shoemaker, P. A. Thiessen, S. He, and J. Zhang. Pubchem bioassay: 2017 update. Nucleic acids research, 45(D1):D955–D963, 2016.
  •  40. E. L. Willighagen, J. W. Mayfield, J. Alvarsson, A. Berg, L. Carlsson, N. Jeliazkova, S. Kuhn, T. Pluskal, M. Rojas-Chertó, O. Spjuth, et al. The chemistry development kit (cdk) v2. 0: atom typing, depiction, molecular formulas, and substructure searching. Journal of Cheminformatics, 9(1):33, 2017.
  •  41. T. Wittkop, D. Emig, S. Lange, S. Rahmann, M. Albrecht, J. H. Morris, S. Böcker, J. Stoye, and J. Baumbach. Partitioning biological data with transitivity clustering. Nature methods, 7(6):419–420, 2010.
  •  42. J. Yang, Z. Chen, Y. Liu, R. J. Hickey, and L. H. Malkas. Altered dna polymerase expression in breast cancer cells leads to a reduction in dna replication fidelity and a higher rate of mutagenesis. Cancer research, 64(16):5597–5607, 2004.
  •  43. S. Zou, Z.-F. Shang, B. Liu, S. Zhang, J. Wu, M. Huang, W.-Q. Ding, and J. Zhou. Dna polymerase iota (pol ) promotes invasion and metastasis of esophageal squamous cell carcinoma. Oncotarget, 7(22):32274, 2016.