Log In Sign Up

Stepping Back to SMILES Transformers for Fast Molecular Representation Inference

In the intersection of molecular science and deep learning, tasks like virtual screening have driven the need for a high-throughput molecular representation generator on large chemical databases. However, as SMILES strings are the most common storage format for molecules, using deep graph models to extract molecular feature from raw SMILES data requires an SMILES-to-graph conversion, which significantly decelerates the whole process. Directly deriving molecular representations from SMILES is feasible, yet there exists a performance gap between the existing unpretrained SMILES-based models and graph-based models at large-scale benchmark results, while pretrain models are resource-demanding at training. To address this issue, we propose ST-KD, an end-to-end SMILES Transformer for molecular representation learning boosted by Knowledge Distillation. In order to conduct knowledge transfer from graph Transformers to ST-KD, we have redesigned the attention layers and introduced a pre-transformation step to tokenize the SMILES strings and inject structure-based positional embeddings. Without expensive pretraining, ST-KD shows competitive results on latest standard molecular datasets PCQM4M-LSC and QM9, with 3-14× inference speed compared with existing graph models.


page 2

page 7

page 8

page 9

page 10

page 11

page 12

page 13


Chemical transformer compression for accelerating both training and inference of molecular modeling

Transformer models have been developed in molecular science with excelle...

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction

GNNs and chemical fingerprints are the predominant approaches to represe...

Graph-based Molecular Representation Learning

Molecular representation learning (MRL) is a key step to build the conne...

High throughput screening with machine learning

This study assesses the efficiency of several popular machine learning a...

MolMiner: You only look once for chemical structure recognition

Molecular structures are always depicted as 2D printed form in scientifi...

Attention-Based Learning on Molecular Ensembles

The three-dimensional shape and conformation of small-molecule ligands a...

FastFlows: Flow-Based Models for Molecular Graph Generation

We propose a framework using normalizing-flow based models, SELF-Referen...

1 Introduction

Recent years have witnessed the remarkable progress in the combination of deep learning and molecular science, with regard to tasks including molecular property prediction (Xiong et al., 2019; Li et al., 2021), conformation generation (Xu et al., 2021), and virtual screening for drug discovery (Gentile et al., 2020; Stevenson et al., 2021)

. Meanwhile, encoding molecules into numerical vectors, or molecular representation, serves as the very first yet important step. As 2D structures of molecules,


atom-bond connectivity, can be naturally represented as graphs, neural networks for graph data have become one of the predominant choices. Exploiting network architectures from graph message-passing networks

(Gilmer et al., 2017) to Transformers (Rong et al., 2020; Ying et al., 2021), these graph-based methods have reached the state-of-the-art performances on standard benchmark datasets.

However, in real applications such as virtual screening, operations on very large datasets are required, and high-throughput molecular encoders are indispensable. In most existing chemical libraries such as ZINC (Irwin and Shoichet, 2005), molecules are stored with SMILES (Simplified Molecular-Input Line-Entry System, Weininger (1988)), a line notation format with strict generating rules that encodes complete 2D structural information. In order to apply graph-based models to SMILES inputs, an additional SMILES-to-Graph conversion is needed. Operated upon an unparallelizable deterministic algorithm (Weininger, 1990)

with high time complexity proved by experiments, this conversion has become a serious efficiency bottleneck that restricts graph-based model from achieving high feature extraction throughput on SMILES datasets. Researchers have also been exploring ways to generate molecular representations (or fingerprints) from SMILES strings in an end-to-end fashion, like the topological ECFP

(Rogers and Hahn, 2010)

and Transformer-based SMILES language models

(Honda et al., 2019; Wang et al., 2019; Chithrananda et al., 2020). However, without pretraining, their overall benchmark results on standard datasets are outperformed by basic graph-based models like GCN (Kipf and Welling, 2016). While for competitive pretrained models like MolFormer (Ross et al., 2021), the inference speed is largely restricted by huge model size, and pretraining requires massive SMILES data and computational resources.

Based on the discussion above, we propose ST-KD, an end-to-end SMILES Transformer architecture incorporated with Knowledge D

istillation (KD) from graph Transformers. Considering SMILES implicitly encodes the complete 2D molecular structure, we believe the key reason for the low performances of current SMILES-based models is that, without structure-related supervision, the hidden atom-bond connectivity within SMILES is not well captured in training. During KD from graph Transformer (teacher) to SMILES Transformer (student), the student is forced to mimic the teacher while learning structural knowledge of molecules, for the distilled hidden representations and attention weights are concentrated knowledge the teacher have learned from molecular graphs. This knowledge transfer process is capable of bringing a magnificent performance boost to SMILES Transformer with little loss of efficiency, as shown in Figure


As SMILES is generated from strict grammar rules and contains different types of tokens (like tokens for atoms, bonds, ring-breaks, etc.), applying standard NLP data preprocess methods will lead to faulty tokenization and ambiguous input embeddings. Hence, we build an efficient module to transform SMILES strings into unified token sequences embedded with clear chemical definition, then attach structure-based positional encodings to them. Besides, graph Transformer and SMILES Transformer have different procedures to update token embeddings and compute pairwise attention interactions, so most standard Transformer distillation methods are infeasible. To solve this problem, in addition to feature distillation, we add parameterized attention biases to ST-KD’s self-attention layers, and partially supervise on the biases to transfer knowledge in attention blocks. Ablation studies have proved our redesigned distillation techniques are essential to optimize the teaching results.

Directly taking SMILES as input, ST-KD has averted the SMILES-to-graph bottleneck and can run faster than graph-based models at inference. Compared to pretrained SMILES models, ST-KD is much more light-weight and efficient at training, without compromising on performance on most tasks. Boosted by the solid network structure and knowledge distillation from cutting-edge graph Transformer (Ying et al., 2021), ST-KD is qualified to deliver competitive, even state-of-the-art results on public molecular benchmarks like MoleculeNet (Wu et al., 2018) and OGB-LSC (Hu et al., 2020). Code for reproducing our results is provided in the supplementary material.

Figure 1: Inference time v.s. accuracy (validation MAE, inversely ordered) on PCQM4M-LSC. ST-KD speeds up compared with existing graph-based models with competitive accuracy. Explanations of the dataset and baseline models are in Section 4.

2 Related Work

SMILES and SMILES-based Fingerprint Generators

SMILES (simplified molecular-input line-entry system), first proposed in (Weininger, 1988; Weininger et al., 1989; Weininger, 1990), is a universal line notation format to represent 2D molecular structure. For example, the SMILES string c1ccccc1C can specify the structure of toluene. SMILES has complex grammars and all symbols have specific chemical definitions, e.g. C and N represent carbon atom and nitrogen atom, - and = represent single bond and double bond, 1 and ( represent (part of) ring-break and branching bond. An important fact is that the original SMILES system does not create a bijective mapping between the set of all possible SMILES sequences and molecules. A legal SMILES sequence can be translated into one certain molecule, but it is possible for a molecule to have multiple SMILES representations, e.g. both CCO and OCC specify ethanol.

Before the age of deep learning, researchers have developed multiple SMILES-based molecular fingerprint generators, including the hash-based methods (Glen et al., 2006; Hu et al., 2009; Rogers and Hahn, 2010) and task-driven methods (O’Boyle et al., 2011). Extended-Connectivity FingerPrint (ECFP) (Rogers and Hahn, 2010) is a famous method for efficiently generating topological structure fingerprints for SMILES. On the other side, the recent spurt of deep models has inspired researchers to build SMILES fingerprint generators by training from data without expert knowledge (Xu et al., 2017; Honda et al., 2019; Zhang et al., 2018; Wang et al., 2019; Chithrananda et al., 2020). Treating SMILES as a formal language, most of these methods take NLP models like Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2018) as backbone, then train the model with pretrain-finetune paradigm for downstream tasks. These methods have shown promising results in specific tasks like molecular property prediction (Ross et al., 2021), chemical reaction prediction (Schwaller et al., 2019) and drug discovery (Honda et al., 2019). As for molecular property prediction, current SMILES-based models can only reach competitive performances against graph-based models with extensive pretraining, like (Ross et al., 2021; Irwin et al., 2021). Compared with large-scale pretraining on massive data, the distillation strategy we employ requires no big data and is much more effective and efficient, which will be discussed in Section 4.2.

Graph-based Molecular Representation Learning

Graph-based deep models (Duvenaud et al., 2015; Gilmer et al., 2017; Xiong et al., 2019; Rong et al., 2020; Ying et al., 2021) have taken the leading role in molecular representation learning. Some studies (Cho and Choi, 2019; Klicpera et al., 2020; Song et al., 2020; Li et al., 2021) also begin to enhance molecular representations with 3D conformation. Graphormer (Ying et al., 2021) is a newly proposed graph Transformer model for graph representation learning, which attains the state-of-the-art performance on many public datasets and wins the OGB Large-Scale Challenge Hu et al. (2021) in quantum chemistry regression dataset. In experiments, we pick Graphormer as the teacher model, considering its outstanding performance and Transformer-based architecture.

Transformer Distillation and Cross-modal Knowledge Transfer

Transformer (Vaswani et al., 2017) has become one of the most popular building blocks in deep models with its powerful self-attention mechanism, which can capture long-term dependencies within sequences of tokens. Transformer-based pretrain language models like BERT Devlin et al. (2018) exhibit excellent performances but are also computationally expensive. There have been many works that attempt to compress Transformer-based models with knowledge distillation (Jiao et al., 2019; Sun et al., 2019a, 2020; Wang et al., 2020)

. The distilled knowledge may be soft target probabilities, embedding outputs, hidden representations or attention weight distributions. Existing works have also explored cross-modal knowledge transfer, including contrastive cross-modal distillation

(Tian et al., 2019) and graph-to-text distillation (Dong et al., 2020). ST-KD is built upon the Transformer architecture and borrows some insights in distillation methods above. However, since the difference between SMILES and other types of structured knowledge is essential, the distillation techniques we apply are completely redesigned to bridge the knowledge transfer between graph-based models and SMILES-based models.

3 Method

Figure 2: Overview of the proposed ST-KD architecture.

3.1 Preliminaries

Problem Formulation

We formulate molecular representation learning as a supervised learning task, which takes notations of molecules as inputs, properties of molecules as supervisions. A molecule

can be denoted as an attributed graph , or a SMILES line notation , where is the set of atoms, is the set of bonds, is the atom feature matrix, is the bond feature matrix, are SMILES symbols.


Figure 2 is an illustration of the proposed ST-KD architecture. SMILES strings are first transformed to generate initial input embeddings, then sent to the stacked layers. ST-KD is built by standard Transformer layers and biased Transformers layers we introduce in Section 3.3, which are designed to enable the knowledge transfer from graph Transformer attention block to SMILES Transformer.

Transformer and Multi-head Attention

A Transformer encoder layer is composed of a multi-head attention (MHA) block and feed-forward networks (FFN) with residue connections. Let

be the input embeddings, the calculation process of multi-head attention can be formulated as:


where is length of input sequence, is number of attention heads, and are projection parameter matrices, are the dimension of hidden layers, keys and values.

Graph Transformer

Graph Transformers (Ying et al., 2021; Dwivedi and Bresson, 2020) generally take graphs as node sequences, update the node representations using attention mechanism and linear layers, and leverage the structural information into the Transformer architecture by positional encodings and attention biases. Given molecule , it should be fed into the graph Transformer model as a featurized graph with initial token representations


where represents for a virtual token for molecule embeddings. Then is updated by stacked Graph Transformer layers to produce the final result.

3.2 SMILES Pre-Transformation and Input Embeddings

In this section, we transform irregular SMILES sequences into a unified representation form, in which all atom and bond tokens (maybe omitted in the original SMILES) are parsed out and embedded with structural information.

SMILES Pre-Transformation

At SMILES pre-tranformation, we convert a general SMILES line notation into a sequence that consists of SMILES symbols, where are atom symbols, are bond symbols. Atom-bond connectivity is also decoded in this process for computing positional embeddings. We build a SMILES parsing module to perform this step, and we omit its detailed description for simplicity due to the extremely complex SMILES grammar. One can refer to the code for our implementation details.

Input Representations

A virtual token [V] is first added to the processed sequence , which represents the molecule embedding and its representation will be updated as normal tokens, like the [CLS]

token in BERT. Then for a given token, its input representation is constructed by summing up its corresponding token embedding and positional embedding. We use a SMILES symbol vocabulary to generate learnable token embeddings, and positional embeddings are calculated from the decoded molecular structure in the previous step. We first initialize a learnable parameter tensor

, where is maximum number of atoms of molecules in the dataset, is the hidden size. Then in the processed SMILES token sequence with special token , for atom token with token embedding , its input embedding is computed by


for bond token with token embedding and links atom and , its input embedding is generated using


as we assign no position embeddings to the special token [V], its input embedding equals its token embedding. The positional embeddings we inject into the input representations are flexible and able to encode complete structural information. Finally, the input embeddings can be formulated as


From here we assume for , in and in represent the same atom in , with such assumption for every , attention interaction from the -th token to -th token stands for the same atom-atom correlation in SMILES Transformer and Graph Transformer.

3.3 SMILES Transformer with Knowledge Distillation

In this section we discuss the knowledge distillation from graph Transformer (GT) to SMILES Transformer (ST), a knowledge transfer process that bridges models with distinct input representations.


For distillation between Transformer layers in language models, features like intermediate states and self-attention distributions can be used as transferred knowledge (Sun et al., 2019b; Jiao et al., 2019; Wang et al., 2020). However, we can not bring most of existing methods to distillation from GT to ST due to unmatched token sequences, for an input token sequence of ST has tokens, while for GT it has tokens. For instance, if we simply distill the GT atom token embeddings or transfer the atom-atom attention weights, experiments have shown this unbalanced teaching (no bond token representations or atom-bond token interactions in ST are supervised) only results in subtle improvements.

Global View of the Distillation Method

Illustrated by Figure 2, our proposed method for distilling knowledge from a GT layer to a ST layer has two parts. First, both models use a virtual token for updating molecular embeddings, so we perform a straightforward feature distillation on the virtual token embeddings to transfer the teacher’s learned molecular feature while avoiding the unmatching issue when distilling on other tokens. Second, to transfer the GT attention weights without disrupting the attention interactions in ST, we introduce a supportive attention bias module to ST layers. This attention bias module learns the atom-atom attention distributions in the teacher model, and highlights the learned pairwise interactions by adding scalar biases in the calculation of attention weights. We describe the detailed distillation scheme in the following paragraphs.

Molecular Feature Distillation

Consider the distillation from layer of GT to layer of ST with corresponding output molecule emebddings and . Then the feature distillation loss can be defined as



specifies mean square loss function,

is a learnable linear transformation matrix to transform features of the student into the same space as the teacher’s features, which can also be removed if


Biased Multi-head Attention

We have introduced the multi-head attention mechanism (MHA) in Section 3.1. Here define biased Transformer layer by adding learnable attention biases to MHA, which serve as a side module to learn and emphasize knowledge from teacher attention weights. Following the notations, for every attention head , the attention bias matrix is calculated by


where are learnable attention bias projections not shared among heads. With an associated attention bias mask , the computing step of biased multi-head attention block can be defined as

Attention Weight Distillation via Biases

The proposed ST-KD consists of standard Transformer layers and biased Transformer layers. We perform the distillation from layer of GT to biased Transformer layer of ST-KD. Suppose layer outputs the attention weight tensor , where is the number of attetion heads and . Our goal is to transfer knowledge in to layer via attention biases. We first stack bias matrices of each attention head in MHA of layer into the attention bias tensor


and we introduce a attention mask matrix to mask out pairwise attention interactions except atom-atom or atom-vn (virtual node) ones. Specifically, the definition is


Notably, filters out attention interactions that won’t match , and also serves as attention bias mask in the following attention calculations. Then we multiply by , and reduce its size to fit (removed elements are all zeros, according to definition of ) in


where is broadcasted at the element-wise product. Finally, by applying the softmax function on the last dimension of , a weight distribution that matches is predicted from the attention biases. By penalizing the distance between the computed distribution and masked , the attention bias module is able to learn the knowledge inside teacher attention weights. We formulate the attention weight distillation loss as


here is an optional learnable linear projection used when the teacher and student have different numbers of attention heads. Unexpectedly, in experiments we find the simple mean square loss function is the best choice for model convergence, rather than cross entropy or KL-divergence.

Final Loss Function

A task-related loss is also need during supervised training. Using the above objectives, we can unify the loss function of ST-KD to


where and are the set of all GT-ST layer pairs for feature and attention weight distillation, and stands for the weight for feature distillation loss and attention weight distillation loss.

4 Experiments

We first conduct knowledge distillation experiments on the recent PCQM4M-LSC dataset (Hu et al., 2021)

for quantum chemistry regression, which is the latest large-scale standard dataset for molecular property prediction and contains more than 3.8M molecules. Then we investigate the transfer learning capacity of ST-KD on several molecular property prediction tasks, with models pretrained by knowledge distillation in the previous step. Finally, we perform tests to support our claim on the efficiency superioiriy of SMILES-based models over graph-based models, and run ablation studies to further inspect each component of ST-KD. A detailed description of datasets can be found in Appendix


4.1 Knowledge Distillation on PCQM4M-LSC


For graph-based methods, we compare ST-KD with GCN (Kipf and Welling, 2016) and GIN (Xu et al., 2018) with their variants with virtual node attached to improve graph property prediction performance (Gilmer et al., 2017). They are standard baselines and achieve good performances on official leaderboard 111 In terms of SMILES-based models, we test the MLP models over the ECFP fingerpint which is derived using a variant of Morgan algorithm (Morgan, 1965), and a SMILES Transformer ST-base that uses a standard SMILES tokenizer and sinusoidal positional encodings. We also apply data augmentation on ST-base by adding 5 random equivalent SMILES for every data in the training set.

ST-KD Settings

We pick Graphormer (Ying et al., 2021) as the teacher model since Graphormer is built upon the Transformer architecture and achieves current state-of-the-art performance on the PCQM4M dataset. We have reproduced the training process of Graphormer on PCQM4M dataset with official code222 and reported our results333Training is performed on 2 NVIDIA RTX 3090 GPUs for 2 days with batchsize 1024.. Graphormer has Transformer layers with hidden dimension 768 and 32 attention heads.

We build the SMILES Transformer with Transformer layers followed by biased Transformer layers. We set the hidden dimension to , number of attention heads to 16, dimension of FFN layers to . For distillation layer mappings, we perform feature distillation and attention weight distillation simultaneously from the last three layers of Graphormer to the three biased Transformer layers, i.e. the set of all GT-ST distillation layer pairs is . For loss weights, we set and . All models are trained on 1 or 2 NVIDIA RTX 3090 GPU for up to 2 days. A complete description of training details is available in Appendix A.2.1.

Model #Params Train Validation
Graph-based GCN 2.0M 0.1522 0.1702(0.1684*)
GIN 3.8M 0.1379 0.1533(0.1510*)
GCN-virtual 4.9M 0.1317 0.1542(0.1536*)
GIN-virtual 6.7M 0.1206 0.1421(0.1396*)
Graphormer 47.1M 0.0590 (0.0582*) 0.1274 (0.1234*)
SMILES-based MLP-ECFP 15.8M 0.1032 0.2113
ST-base 19.1M 0.1483 0.2031
ST-KD 20.4M 0.0877 0.1379
Table 1: Results on PCQM4M-LSC measured by MAE. * indicates results cited from OGB-LSC paper (Hu et al., 2021) and Graphormer paper (Ying et al., 2021).

Table 1 summarizes model performances on PCQM4M-LSC dataset. ST-KD surpasses all previous SMILES-based models by a very large margin and performs better than the best GNN baseline GIN-virtual, which proves that the performance gap between SMILES-based models and graph-based models can be overcome with the proposed strategies. Moreover, performance of SMILES Transformer attains a magnificent boost from 0.2062 to 0.1379, augmented with knowledge distillation and SMILES preprocess techniques we propose. Ablation studies on it can be found in Section 4.4. Still, ST-KD has a long way to go before catching up with its Graphormer teacher, but the progress we make so far has shown the extensive potentials of SMILES-based models.

4.2 Molecular Property Prediction

In this section, we evaluate performances of ST-KD on popular molecular property prediction datasets QM9, QM8, QM7, FreeSolv and HIV in MoleculeNet (Wu et al., 2018). We mainly explore the transfer learning capacity of ST-KD by fine-tuning the model trained by knowledge distillation on PCQM4M-LSC in the previous step.


In this section, we benchmark ST-KD with GCN-virtual, GIN-virtual, Graphormer, MLP-ECFP and ST-base mentioned above in Section 4.1, along with MoleculeNet (best performances collected in (Wu et al., 2018)), MoLFormer-XL (Ross et al., 2021) and AttentiveFP (Xiong et al., 2019), which is a graph-based deep model that generates molecular fingerprints with local and global attentive layers. For Graphormer, we initialize its weights with the final checkpoint from PCQM4M-LSC training in Section 4.1.


For dataset split, we split the dataset randomly into 8:1:1 as training, validation and test sets if no predefined split is available. For all models, we train them for 3 times with different random seeds, and report the means and standard deviations of performances. For ST-KD, in all datasets, we use the same model hyperparameters as section

4.1 and model weights are initialized with the checkpoint of best validation set performance saved in distillation on PCQM4M-LSC. Appendix A.2.2 presents detailed training procedures. Results are reported in Table 2.

Dataset QM9 QM8 QM7 FreeSolv HIV
GCN-virtual 1.36820.0482 0.01230.0004 74.0812.182 1.0130.069 75.991.0
GIN-virtual 1.22480.0554 0.01110.0004 69.6821.718 0.8520.075 77.070.8
MoleculeNet 2.350* 0.01500.0020* 94.72.7* 1.150* 79.20*
AttentiveFP 1.292 0.01300.0006 66.22.713 0.9620.197 79.101.2
Graphormer 0.91680.0301 0.00910.0002 44.3421.419 0.8600.082 80.51*
MLP-ECFP 1.20210.0573 0.01290.0004 65.8451.321 0.9950.076 74.151.1
MoLFormer-XL 0.6804* 0.0102* - 0.2308* 82.2*
ST-base 1.18620.0510 0.01690.0006 52.3483.211 1.0710.089 72.101.4
ST-KD 0.62150.0117 0.00800.0002 43.3731.770 0.8950.065 80.320.7
Table 2: Results on molecular property prediction datasets. * indicates results cited from (Wu et al., 2018; Ross et al., 2021; Ying et al., 2021), for higher is better, contrarily. Additional information on ST-KD’s performance on QM9 can be found in Appendix A.2.3.

ST-KD shows outstanding performance on the QM datasets, being state-of-the-art and outperforms both Graphormer and pretrained MoLFormer-XL. On FreeSolv and HIV, MolFormer-XL shows leading performance due to large-scale pretraining, while ST-KD still delivers competitive results with performable graph-based baselines. Compared to ST-base, ST-KD gains significant performance improvement on all downstream molecular property prediction tasks, which confirms the effectiveness of our proposed distillation method. Despite being outperformed by the Graphormer teacher on PCQM4M-LSC, on downstream tasks, ST-KD performs better on QM datasets and slightly worse on FreeSolv and HIV, indicating that SMILES-based models can fit better on certain tasks than graph-based models. Compared to MoLFormer-XL (Ross et al., 2021), which requires pretraining on 16 NVIDIA V100 GPUs for 208 hours with at least 81M parameter size 444

No MolFormer-XL code is available, we estimate this given that MoLFormer-XL has 12 layers with hidden dimension 768, and we assume parameters are stored with 32-bit float numbers.

, ST-KD is light-weight (20.4M, 32-bit float), efficient at training (2 NVIDIA RTX 3090 GPUs for 80 hours, including training Graphormer), and achieves better results on QM8 and QM9. This demonstrates that knowledge distillation is an effective and efficient way to raise the performance of SMILES-based models to state-of-the-art level, with contrast to large-scale pretraining.

4.3 Efficiency Tests


We follow the OGB-LSC official guide for measuring total SMILES-to-target inference time in 555In Measuring the Test Inference Time of and code in this directory., and apply all related parameters. To replicate a real-world scenario, we run tests on PCQM4M-LSC and a ZINC subset containing 1M molecules sampled from the public ZINC15 database (Sterling and Irwin, 2015), which represents for real molecular data in drug discovery. All models take raw SMILES strings as input, with their inference time being evaluated on a single NVIDIA GeForce RTX 3090 GPU and an Intel(R) Xeon(R) Platinum 8260C CPU @ 2.30GHz, 256G RAM with warm up. All models are directly taken from Section 4.1. The inference time is measured by millisecond (ms) per molecule. We report the results by the mean of 3 runs in Table 3.

Dataset PCQM4M-LSC ZINC Subset
Input Graph SMILES Graph SMILES
Graph-based GCN-virtual 0.121 0.627 0.204 0.906
GIN-virtual 0.118 0.627 0.209 0.913
Graphormer 0.399 2.795 0.573 4.035
SMILES-based MLP-ECFP - 0.352 - 0.467
ST-base - 0.149 - 0.230
ST-KD - 0.192 - 0.271
Table 3: Results of inference time tests, measured by ms per molecule.

Table 3 summarizes the inference time of all models. It can be observed that ST-KD reaches a higher inference speed from raw SMILES strings than graph-based models, which proves our statements. Compared to ST-base, ST-KD runs relatively slower with the additional SMILES preprocessing step and attention bias computations, but has significantly better performance. We can infer from the Table that the SMILES-to-Graph conversion is a serious efficiency bottleneck for graph-based models, as we have mentioned in Section 1.

Comparison with SMILES Pretrain Models

We are unable to run inference time tests of most SMILES pretrain models since code is not publicly available, but we can make reliable conjectures about their inference speed with model hyperparameters. Considering MoLFormer-XL, which has 12 attention layers with hidden dimension 768, we can reasonably expect it to run at least 2-3 times slower than ST-KD (6 layers with hidden dimension 512) at inference. This also confirms the efficiency superiority of ST-KD over SMILES pretrain models.

4.4 Ablation Studies

We run a series of ablation studies to further look into the importance of each component in the designed ST-KD architecture on the knowledge distillation experiments with PCQM4M-LSC dataset. Other model hyperparameter settings and training procedures stay the same as in Section 4.1. Table 4 demonstrates that all techniques we have plugged into ST-KD are necessary to raise the perfomance of SMILES Transformer to its best level. With the SMILES pre-transformation, the Transformer gains increased performance, since the input sequences now consist of well-structured tokens and meaningful input emebddings. As for knowledge distillation, both type of the distilled knowledge contribute to the teaching process. The performance of ST-KD can be optimized only with the combination of feature distillation and attention weight distillation.

SMILES Pre-transformation Distillation Validation Perf.
Feature Transfer Attention Transfer MAE
- - - 0.2062
- - 0.1808
- 0.1672
- 0.1589
Table 4: Ablation studies of ST-KD on PCQM4M-LSC dataset with different components removed.

5 Conclusion

We have proposed ST-KD, a novel architecture for SMILES Transformer empowered by knowledge distillation and shown its competitive results on benchmarks. As high-throughput and high-performance molecular feature extractors, SMILES-based models exhibit significant practical potentials in handling large-scale molecular data. With initial results being encouraging, challenges still exist. Our distillation approach cannot be applied without a well-learned graph model as teacher. In distillation experiments, Graphormer outperforms its student well, so results of ST-KD can be further improved with a more effective knowledge transfer strategy. We leave them as future works.


  • S. Chithrananda, G. Grand, and B. Ramsundar (2020) Chemberta: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885. Cited by: §1, §2.
  • H. Cho and I. S. Choi (2019) Enhanced deep-learning prediction of molecular properties via augmentation of bond topology. ChemMedChem 14 (17), pp. 1604–1609. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2, §2.
  • J. Dong, M. Rondeau, and W. L. Hamilton (2020) Distilling structured knowledge for text-based relational reasoning. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 6782–6791. Cited by: §2.
  • D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. arXiv preprint arXiv:1509.09292. Cited by: §2.
  • V. P. Dwivedi and X. Bresson (2020)

    A generalization of transformer networks to graphs

    arXiv preprint arXiv:2012.09699. Cited by: §3.1.
  • F. Gentile, V. Agrawal, M. Hsing, A. Ton, F. Ban, U. Norinder, M. E. Gleave, and A. Cherkasov (2020) Deep docking: a deep learning platform for augmentation of structure based drug discovery. ACS central science 6 (6), pp. 939–949. Cited by: §1.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In

    International conference on machine learning

    pp. 1263–1272. Cited by: §1, §2, §4.1.
  • R. C. Glen, A. Bender, C. H. Arnby, L. Carlsson, S. Boyer, and J. Smith (2006) Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to adme. IDrugs 9 (3), pp. 199. Cited by: §2.
  • S. Honda, S. Shi, and H. R. Ueda (2019) Smiles transformer: pre-trained molecular fingerprint for low data drug discovery. arXiv preprint arXiv:1911.04738. Cited by: §1, §2.
  • W. Hu, M. Fey, H. Ren, M. Nakata, Y. Dong, and J. Leskovec (2021) Ogb-lsc: a large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430. Cited by: §A.1, §2, Table 1, §4.
  • W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec (2020) Open graph benchmark: datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687. Cited by: §1.
  • Y. Hu, E. Lounkine, and J. Bajorath (2009) Improving the search performance of extended connectivity fingerprints through activity-oriented feature filtering and application of a bit-density-dependent similarity function. ChemMedChem: Chemistry Enabling Drug Discovery 4 (4), pp. 540–548. Cited by: §2.
  • J. J. Irwin and B. K. Shoichet (2005) ZINC- a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling 45 (1), pp. 177–182. Cited by: §1.
  • R. Irwin, S. Dimitriadis, J. He, and E. Bjerrum (2021) Chemformer: a pre-trained transformer for computational chemistry. Cited by: §2.
  • X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu (2019) Tinybert: distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351. Cited by: §2, §3.3.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §4.1.
  • J. Klicpera, J. Groß, and S. Günnemann (2020) Directional message passing for molecular graphs. arXiv preprint arXiv:2003.03123. Cited by: §2.
  • Z. Li, S. Yang, G. Song, and L. Cai (2021) HamNet: conformation-guided molecular representation with hamiltonian neural networks. arXiv preprint arXiv:2105.03688. Cited by: §1, §2.
  • H. L. Morgan (1965) The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service.. Journal of Chemical Documentation 5 (2), pp. 107–113. Cited by: §4.1.
  • M. Nakata and T. Shimazaki (2017) PubChemQC project: a large-scale first-principles electronic structure database for data-driven chemistry. Journal of chemical information and modeling 57 (6), pp. 1300–1308. Cited by: §A.1.
  • N. M. O’Boyle, C. M. Campbell, and G. R. Hutchison (2011) Computational design and selection of optimal organic photovoltaic materials. The Journal of Physical Chemistry C 115 (32), pp. 16200–16210. Cited by: §2.
  • B. Ramsundar, P. Eastman, P. Walters, V. Pande, K. Leswing, and Z. Wu (2019) Deep learning for the life sciences. O’Reilly Media. Note: Cited by: §A.1.
  • D. Rogers and M. Hahn (2010) Extended-connectivity fingerprints. Journal of chemical information and modeling 50 (5), pp. 742–754. Cited by: §1, §2.
  • Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang (2020) Self-supervised graph transformer on large-scale molecular data. arXiv preprint arXiv:2007.02835. Cited by: §1, §2.
  • J. Ross, B. Belgodere, V. Chenthamarakshan, I. Padhi, Y. Mroueh, and P. Das (2021) Do large scale molecular language representations capture important structural information?. arXiv preprint arXiv:2106.09553. Cited by: §1, §2, §4.2, §4.2, Table 2.
  • P. Schwaller, T. Laino, T. Gaudin, P. Bolgar, C. A. Hunter, C. Bekas, and A. A. Lee (2019) Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS central science 5 (9), pp. 1572–1583. Cited by: §2.
  • Y. Song, S. Zheng, Z. Niu, Z. Fu, Y. Lu, and Y. Yang (2020) Communicative representation learning on attributed molecular graphs.. In IJCAI, pp. 2831–2838. Cited by: §2.
  • T. Sterling and J. J. Irwin (2015) ZINC 15–ligand discovery for everyone. Journal of chemical information and modeling 55 (11), pp. 2324–2337. Cited by: §4.3.
  • G. A. Stevenson, D. Jones, H. Kim, W. Bennett, B. J. Bennion, M. Borucki, F. Bourguet, A. Epstein, M. Franco, B. Harmon, et al. (2021) High-throughput virtual screening of small molecule inhibitors for sars-cov-2 protein targets with deep fusion models. arXiv preprint arXiv:2104.04547. Cited by: §1.
  • S. Sun, Y. Cheng, Z. Gan, and J. Liu (2019a) Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355. Cited by: §2.
  • Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou (2019b) Mobilebert: task-agnostic compression of bert by progressive knowledge transfer. . Cited by: §3.3.
  • Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou (2020) Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984. Cited by: §2.
  • Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive representation distillation. arXiv preprint arXiv:1910.10699. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §2.
  • S. Wang, Y. Guo, Y. Wang, H. Sun, and J. Huang (2019) SMILES-bert: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, pp. 429–436. Cited by: §1, §2.
  • W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou (2020) Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers. arXiv preprint arXiv:2002.10957. Cited by: §2, §3.3.
  • D. Weininger, A. Weininger, and J. L. Weininger (1989) SMILES. 2. algorithm for generation of unique smiles notation. Journal of chemical information and computer sciences 29 (2), pp. 97–101. Cited by: §2.
  • D. Weininger (1988) SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences 28 (1), pp. 31–36. Cited by: §1, §2.
  • D. Weininger (1990) SMILES. 3. depict. graphical depiction of chemical structures. Journal of chemical information and computer sciences 30 (3), pp. 237–243. Cited by: §1, §2.
  • Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande (2018) MoleculeNet: a benchmark for molecular machine learning. Chemical science 9 (2), pp. 513–530. Cited by: §1, §4.2, §4.2, Table 2.
  • Z. Xiong, D. Wang, X. Liu, F. Zhong, X. Wan, X. Li, Z. Li, X. Luo, K. Chen, H. Jiang, et al. (2019) Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal of medicinal chemistry 63 (16), pp. 8749–8760. Cited by: §1, §2, §4.2.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §4.1.
  • M. Xu, S. Luo, Y. Bengio, J. Peng, and J. Tang (2021) Learning neural generative dynamics for molecular conformation generation. arXiv preprint arXiv:2102.10240. Cited by: §1.
  • Z. Xu, S. Wang, F. Zhu, and J. Huang (2017) Seq2seq fingerprint: an unsupervised deep molecular embedding for drug discovery. In Proceedings of the 8th ACM international conference on bioinformatics, computational biology, and health informatics, pp. 285–294. Cited by: §2.
  • C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T. Liu (2021) Do transformers really perform bad for graph representation?. arXiv preprint arXiv:2106.05234. Cited by: §1, §1, §2, §3.1, §4.1, Table 1, Table 2.
  • X. Zhang, S. Wang, F. Zhu, Z. Xu, Y. Wang, and J. Huang (2018) Seq3seq fingerprint: towards end-to-end semi-supervised deep drug discovery. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 404–413. Cited by: §2.

Appendix A Appendix

a.1 Datasets

Statistics of datasets we use in this paper are shown in Table 5. PCQM4M-LSC is a large-scale graph property prediction dataset in the recent OGB-LSC Large-Scale Challenge (Hu et al., 2021), originally curated under the PubChemQC project (Nakata and Shimazaki, 2017). The task of PCQM4M-LSC is to predict HOMO-LUMO gap of molecules calculated by DFT (Density Functional Theory) with 2D structure of mulecules. Test labels of PCQM4M-LSC is now unavailable, and we use validation set to evaluate model performance. QM9, QM8 and QM7 are datasets containing organic molecules with up to 9, 8 and 7 heavy atoms (any atom that is not hydrogen) and their quantitative quantum-chemical properties. For QM7, we use a sanitized version by DeepChem (Ramsundar et al., 2019) available in FreeSolv contains molecules with their corresponding quantitative hydration free energy. HIV contains molecules labeled by their HIV active labels, with a predefined scaffold data split.

Datasets PCQM4M-LSC QM9 QM8 QM7 FreeSolv HIV
#molecules 3803453 133885 21786 6834 642 41127
#tasks 1 12 16 1 1 1
Task Type Regression Classification
Table 5: Statistics of Datasets.

a.2 Experiment Details

a.2.1 Knowledge Distillation on PCQM4M-LSC

All parameters involved are listed in Table 6. During experiments we find by adding an initial stage when there is no task-related supervision, the model can have a faster convergence. Parameters of this initial stage is also shown.

Parameter Value
#Transformer Layers 3
#Biased Transformer Layers 3
Hidden Dimension 512
Key Dimension of Attention Block 32
Feed-Forward Network Dimension 2048
#Attention Heads 16
Dropout Value 0.1
Feature Distillation GT-ST Layer Pairs {(9,3), (10,4), (11,5)}
Attention Distillation GT-ST Layer Pairs {(9,3), (10,4), (11,5)}
Task Loss Weight 1.0
Feature Distillation Loss Weight 0.5
Attention Distillation Loss Weight 2.0

Initial Stage Epochs

Initial Stage Task Loss Weight 0.0
Initial Stage Feature Distillation Loss Weight 1.0
Initial Stage Attention Distillation Loss Weight 4.0
Optimizer AdamW
Max Epochs 200
Batch Size 256
Learning Rate 1e-4
Weight Decay 1e-5
of AdamW 1e-8
of AdamW (0.9,0.999)
Table 6: Parameter settings of knowledge distillation experiment on PCQM4M-LSC.

a.2.2 Molecular Property prediction Experiments

For ST-KD, we load the model checkpoint with best performance on validation set saved during distillation as initial weights for molecular property prediction experiments in this section. In experiments we discover that results can benefit from changing the model dropout value at this finetune training process. All related parameters are listed in Table 8 and 8.

Parameter Value Optimizer AdamW Max Epochs 100 Batch Size 80 Dropout Value 0.0 Learning Rate 1e-4 Weight Decay 1e-5 of AdamW 1e-8 of AdamW (0.9,0.999)
Table 7: Parameter settings of molecular property prediction experiments on QM9, QM8 and QM7.
Parameter Value Optimizer AdamW Max Epochs 100 Batch Size 50 Dropout Value 0.1 Learning Rate 1e-4 Weight Decay 1e-5 of AdamW 1e-8 of AdamW (0.9,0.999)
Table 8: Parameter settings of molecular property prediction experiments on FreeSolv and HIV.

a.2.3 Additional Information on Performances of ST-KD on QM9 Dataset

QM9 is a dataset with 12 regression targets. We follow the conventions to standardize the targets and fit them simultaneously. Here we report ST-KD’s test set performance on 12 tasks seperately.

Task MAE (Standardized) MAE
mu 0.23196.421e-3 0.35489.827e-3
alpha 0.031301.208e-3 0.25639.893e-3
homo 0.10828.654e-4 2.390e-31.912e-5
lumo 0.051737.760e-4 2.426e-33.640e-5
gap 0.068031.167e-3 3.232e-35.544e-5
r2 0.063011.061e-3 15.730.2651
zpve 0.013571.634e-3 4.518e-45.440e-5
u0 6.300e-32.160e-4 0.25348.654e-3
u298 6.233e-31.886e-4 0.24977.554e-3
h298 6.402e-35.657e-4 0.25642.266e-2
g298 6.567e-32.494e-4 0.26319.993e-3
cv 0.028207.118e-4 0.11462.892e-3
Table 9: Performances of ST-KD on QM9, reported by seperate tasks.