Log In Sign Up

Learning to SMILE(S)

This paper shows how one can directly apply natural language processing (NLP) methods to classification problems in cheminformatics. Connection between these seemingly separate fields is shown by considering standard textual representation of compound, SMILES. The problem of activity prediction against a target protein is considered, which is a crucial part of computer aided drug design process. Conducted experiments show that this way one can not only outrank state of the art results of hand crafted representations but also gets direct structural insights into the way decisions are made.


page 1

page 2

page 3

page 4


Exploring Chemical Space using Natural Language Processing Methodologies for Drug Discovery

Text-based representations of chemicals and proteins can be thought of a...

Profitable Trade-Off Between Memory and Performance In Multi-Domain Chatbot Architectures

Text classification problem is a very broad field of study in the field ...

Contrastive Representation Learning for 3D Protein Structures

Learning from 3D protein structures has gained wide interest in protein ...

LINDA: Unsupervised Learning to Interpolate in Natural Language Processing

Despite the success of mixup in data augmentation, its applicability to ...

Predicting Decisions in Language Based Persuasion Games

Sender-receiver interactions, and specifically persuasion games, are wid...

1 Introduction

Computer aided drug design has become a very popular technique for speeding up the process of finding new biologically active compounds by drastically reducing number of compounds to be tested in laboratory. Crucial part of this process is virtual screening, where one considers a set of molecules and predicts whether the molecules will bind to a given protein. This research focuses on ligand-based virtual screening, where the problem is modelled as a supervised, binary classification task using only knowledge about ligands (drug candidates) rather than using information about the target (protein).

One of the most underrepresented application areas of deep learning (DL) is believed to be cheminformatics 

(Unterthiner et al., 2014; Bengio et al., 2012), mostly due the fact that data is naturally represented as graphs and there are little direct ways of applying DL in such setting (Henaff et al., 2015). Notable examples of DL successes in this domain are winning entry to Merck competition in 2012 (Dahl et al., 2014)

and Convolutional Neural Network (CNN) used for improving data representation 

(Duvenaud et al., 2015). To the authors best knowledge all of the above methods use hand crafted representations (called fingerprints) or use DL methods in a limited fashion. The main contribution of the paper is showing that one can directly apply DL methods (without any customization) to the textual representation of compound (where characters are atoms and bonds). This is analogous to recent work showing that state of the art performance in language modelling can be achieved considering character-level representation of text (Kim et al., 2015; Jozefowicz et al., 2016).

1.1 Representing molecules

Standard way of representing compound in any chemical database is called SMILES, which is just a string of atoms and bonds constructing the molecule (see Fig. 3

) using a specific walk over the graph. Quite surprisingly, this representation is rarely used as a base of machine learning (ML) methods 

(Worachartcheewan et al., 2014; Toropov et al., 2010).

Most of the classical ML models used in cheminformatics (such as Support Vector Machines or Random Forest) work with constant size vector representation through some predefined embedding (called

fingerprints). As a result many such fingerprints have been proposed across the years (Hall and Kier, 1995; Steinbeck et al., 2003). One of the most common ones are the substructural ones - analogous of bag of word representation in NLP, where fingerprint is defined as a set of graph templates (SMARTS), which are then matched against the molecule to produce binary (set of words) or count (bag of words) representation. One could ask if this is really necessary, having at one’s disposal DL methods of feature learning.

1.2 Analogy to sentiment analysis

The main contribution of this paper is identifying analogy to NLP and specifically sentiment analysis, which is tested by applying state of the art methods 

(Mesnil et al., 2014) directly to SMILES representation. The analogy is motivated by two facts. First, small local changes to structure can imply large overall activity change (see Fig. 3), just like sentiment is a function of sentiments of different clauses and their connections, which is the main argument for effectiveness of DL methods in this task (Socher et al., 2013). Second, perhaps surprisingly, compound graph is almost always nearly a tree. To confirm this claim we calculate molecules diameters, defined as a maximum over all atoms of minimum distance between given atom and the longest carbon chain in the molecule. It appears that in practise analyzed molecules have diameter between 1 and 6 with mean 4. Similarly, despite the way people write down text, human thoughts are not linear, and sentences can have complex clauses. Concluding, in organic chemistry one can make an analogy between longest carbon chain and sentence, where branches stemming out of the longest chain are treated as clauses in NLP.

Figure 1: SMILES produced for the compound in the figure is N(c1)ccc1N.
Figure 2: Substituting highlighted carbon atom with nitrogen renders compound inactive.
Figure 3: Visualization of CNN filters of size 5 for active (top row) and inactives molecules.

2 Experiments

Five datasets are considered. Except SMILES, two baseline fingerprint compound representations are used, namely MACCS Ewing et al. (2006) and Klekota–Roth Klekota and Roth (2008) (KR; considered state of the art in substructural representation (Czarnecki et al., 2015)

). Each dataset is fairly small (mean size is 3000) and most of the datasets are slightly imbalanced (with mean class ratio around 1:2). It is worth noting that chemical databases are usually fairly big (ChEMBL size is 1.5M compounds), which hints at possible gains by using semi-supervised learning techniques.

Tested models include both traditional classifiers: Support Vector Machine (SVM) using Jaccard kernel, Naive Bayes (NB), Random Forest (RF) as well as neural network models: Recurrent Neural Network Language Model 

(Mikolov et al., 2011b)

(RNNLM), Recurrent Neural Network (RNN) many to one classifier, Convolutional Neural Network (CNN) and Feed Forward Neural Network with ReLU activation. Models were selected to fit two criteria: span state of the art models in single target virtual screening 

(Czarnecki et al., 2015; Smusz et al., 2013)

and also cover state of the art models in sentiment analysis. For CNN and RNN a form of data augmentation is used, where for each molecule random SMILES walks are computed and predictions are averaged (not doing so degrades strongly performance, mostly due to overfitting). For methods which are not designed to work on string representation (such as SVM, NB, RF, etc.) SMILES are embedded as n-gram models with simple tokenization (

[Na+] becomes a single token). For all the remaining ones, SMILES are treated as strings composed of 2-chars symbols (thus capturing atom and its relation to the next one).

Using RNNLM, and

are modelled separately and classification is done through logistic regression fitted on top. For CNN, purely supervised version of

context, current state of the art in sentiment analysis (Johnson and Zhang, 2015)

, is used. Notable feature of the model is working directly on one-hot representation of the data. Each model is evaluated using 5-fold stratified cross validation. Internal 5-fold grid is used for fitting hyperparameters (truncated in the case of deep models). We use log loss as an evaluation metric to include both classification results as well as uncertainty measure provided by models. Similar conclusions are true for accuracy.

2.1 Results

Results are presented in Table 1. First, simple n-gram models (SVM, RF) performance is close to hand crafted state of the art representation, which suggests that potentially any NLP classifier working on n-gram representation might be applicable. Maybe even more interestingly, current state of the art model for sentiment analysis - CNN - despite small dataset size, outperforms (however by a small margin) traditional models.

model 5-HT 5-HT 5-HT H1 SERT






Table 1: Log-loss ( std) of each model for a given protein and representation.

Hyperparameters selected for CNN (context) are similar to the parameters reported in (Johnson and Zhang, 2015). Especially the maximum pooling (as opposed to average pooling) and moderately sized regions (5 and 3) performed best (see Fig. 3). This effect for NLP is strongly correlated with the fact that small portion of sentence can contribute strongly to overall sentiment, thus confirming claimed molecule-sentiment analogy.

RNN classifier’s low performance can be attributed to small dataset sizes, as commonly RNN are applied to significantly larger volumes of data (Mikolov et al., 2011a). One alternative is to consider semi-supervised version of RNN (Dai and Le, 2015). Another problem is that compound activity prediction requires remembering very long interactions, especially that neighbouring atoms in SMILES walk are often disconnected in the original molecule.

3 Conclusions

This work focuses on the problem of compounds activity prediction without hand crafted features used to represent complex molecules. Presented analogies with NLP problems, and in particular sentiment analysis, followed by experiments performed with the use of state of the art methods from both NLP and cheminformatics seem to confirm that one can actually learn directly from raw string representation of SMILES instead of currently used embedding. In particular, performed experiments show that despite being trained on relatively small datasets, CNN based solution can actually outperform state of the art methods based on structural fingerprints in ligand-based virtual screening task. At the same time it gives possibility to easily incorporate unsupervised and semi-supervised techniques into the models, making use of huge databases of chemical compounds. It appears that cheminformatics can strongly benefit from NLP and further research in this direction should be conducted.


First author was supported by Grant No. DI 2014/016644 from Ministry of Science and Higher Education, Poland.