Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First Data Release

by   Yadu Babuji, et al.

Researchers across the globe are seeking to rapidly repurpose existing drugs or discover new drugs to counter the the novel coronavirus disease (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules. As a contribution to that effort, we are aggregating numerous small molecules from a variety of sources, using high-performance computing (HPC) to computer diverse properties of those molecules, using the computed properties to train ML/AI models, and then using the resulting models for screening. In this first data release, we make available 23 datasets collected from community sources representing over 4.2 B molecules enriched with pre-computed: 1) molecular fingerprints to aid similarity searches, 2) 2D images of molecules to enable exploration and application of image-based deep learning methods, and 3) 2D and 3D molecular descriptors to speed development of machine learning models. This data release encompasses structural information on the 4.2 B molecules and 60 TB of pre-computed data. Future releases will expand the data to include more detailed molecular simulations, computed models, and other products.


page 4

page 7


Artificial Intelligence based Autonomous Molecular Design for Medical Therapeutic: A Perspective

Domain-aware machine learning (ML) models have been increasingly adopted...

AI- and HPC-enabled Lead Generation for SARS-CoV-2: Models and Processes to Extract Druglike Molecules Contained in Natural Language Text

Researchers worldwide are seeking to repurpose existing drugs or discove...

SELFIES and the future of molecular string representations

Artificial intelligence (AI) and machine learning (ML) are expanding in ...

Constant Size Molecular Descriptors For Use With Machine Learning

A set of molecular descriptors whose length is independent of molecular ...

IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System

Like many scientific fields, new chemistry literature has grown at a sta...

Combating small molecule aggregation with machine learning

Biological screens are plagued by false positive hits resulting from agg...

Learning to Discover Medicines

Discovering new medicines is the hallmark of human endeavor to live a be...

1 Introduction

The Coronavirus Disease (COVID-19) pandemic, caused by transmissible infection of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus [zhou2020network, sheahan2020orally, Heiser2020.04.21.054387, Gordon2020], has resulted in millions of diagnosed cases and over deaths worldwide [jhu-map], straining healthcare systems, and disrupting key aspects of society and the wider economy. In order to save lives and reduce societal effects, it is important to rapidly find effective treatments through drug discovery and repurposing efforts.

Here, we describe a public data release of 23 molecular datasets collected from community sources or created internally, representing over 4.2 B molecules. In addition to collecting the datasets from heterogeneous locations and making them available through a unified interface, we have enriched the datasets with additional context that would be difficult for many researchers to compute without access to significant HPC resources. For example, these data now include the 2D and 3D molecular descriptors, computed molecular fingerprints, 2D images representing the molecule, and canonical simplified molecular-input line-entry system (SMILES) [weininger1989smiles] structural representations to speed development of machine learning models.

This data release encompasses information on the 4.2 B molecules and 60 TB of additional data. We intend to supplement this dataset in future releases with more datasets, further enrichments, tools to extract potential drugs from natural language text, and machine learning models to sift the best candidates for protein docking simulations from the billions of available molecules. In the following, we first describe the datasets collected, the methodology used to generate the enriched datasets, and then discuss future directions.

2 Collected Datasets

We have collected molecules from the datasets listed in Table 1, each of which has either been made available online by others or generated by our group. The collected datasets include some specifically collected for drug design (e.g., Enamine), known drug databases (e.g., Drugbank [DBK-article, DBK-web], DrugCentral [DCL-article, DCL-web], CureFFI [FFI-web]), antiviral collections (e.g., CAS COVID-19 Antiviral Candidate Compounds [CAS-web], and the Lit COVID-19 dataset[lit-db]), others that provide known decoys (DUDE database of useful decoys), and further counterexamples including molecules used in other domains (e.g., QM9 [QM9-article, QM9-web], Harvard Organic Photovoltaic Dataset [HOP-article, HOP-web]). By aggregating these diverse datasets, including the decoys and counterexamples, we aim to allow researchers the maximal freedom to create training sets for specific use cases. Future releases will include additional data relevant to SARS-CoV-2 research.

Key Name # Molecules
BDB The Binding Database [BDB-article, BDB-web]
CAS CAS COVID-19 Antiviral Candidate Compounds [CAS-web]
CHM CheMBL db of bioactive mols with drug-like properties
DBK Drugbank [DBK-article, DBK-web]
DCL DrugCentral Online Drug Compendium [DCL-article, DCL-web]
DUD DUDE database of useful decoys [DUD-article, DUD-web]
E15 Diverse REAL drug-like subset of ENA
EDB DrugBank plus Enamine Hit Locator Library 2018 [EDB-web]
EMO eMolecules [EMO-web]
ENA Enamine REAL Database [ENA-article, ENA-web]
FFI CureFFI FDA-approved drugs and CNS drugs [FFI-web]
G13 GDB-13 small organic molecules up to 13 atoms [G13-article, G13-web]
G17 GDB-17-Set up to 17 atom extension of GDB-13 [G17-article, G17-web]
HOP Harvard Organic Photovoltaic Dataset [HOP-article, HOP-web] 350
LIT COVID-relevant small mols extracted from literature [lit-db] 803
MOS Molecular Sets (MOSES) [MOS-article, MOS-web]
MCU MCULE compound database
PCH PubChem [PCH-article, PCH-web]
QM9 QM9 subset of GDB-17 [QM9-article, QM9-web]
REP Repurposing-related drug/tool compounds [REP-article, REP-web]
SAV Synthetically Accessible Virtual Inventory (SAVI) [SAV-article, SAV-web]
SUR SureChEMBL dataset of molecules from patents [SUR-article, SUR-web]
ZIN ZINC15 [ZIN-article, ZIN-web]
Table 1: The datasets included in the first data release, with for each a key, a brief description and references to the original location, and the number of molecules. Datasets labeled with are provided as decoys or examples of molecules used in other domains.

3 Methodology and Data Processing Pipeline

The data processing pipeline is used to compute different types of features and representations of billions of small molecules. The pipeline is first used to convert the SMILES representation for each molecule to a canonical SMILES to allow for de-duplication and consistency across data sources. Next, for each molecule, three different types of features are computed: 1) molecular fingerprints that encode the structure of molecules; 2) 2D and 3D molecular descriptors; and 3) 2D images of the molecular structure. These features are being used as input to various machine learning and deep learning models that will be used to predict important characteristics of candidate molecules including docking scores, toxicity, and more.

Figure 1:

The computational pipeline that is used to enrich the data collected from included datasets. After collection, each molecule in each dataset has canonical SMILES, 2D and 3D molecular features, fingerprints, and images computed. These enrichments simplify molecule disambiguation, ML-guided compound screening, similarity searching, and neural network training respectively.

Term Description
SOURCE-KEY Identifies the source dataset: see the three-letter “Keys” in Table 1
IDENTIFIER A per-molecule identifier either obtained from the source dataset or, if none such is available, defined internally
SMILES A canonical SMILES for a molecule, as produced by Open Babel
Table 2: Definitions for terms used in the methodology section to describe key aspects of the collected datasets and computed properties.

3.1 Canonical Molecule Structures

We use Open Babel v3.0 [o2011open] to convert the simplified molecular-input line-entry system (SMILES) specifications of chemical species obtained from various sources into a consistent canonical smiles representation. We organize the resulting molecule specifications in one directory per source dataset, each containing one CSV file with columns SOURCE-KEY, IDENTIFIER, SMILES, where SOURCE-KEY identifies the source dataset; IDENTIFIER is an identifier either obtained from the source dataset or, if none such is available, defined internally; and SMILES is a canonical SMILES as produced by Open Babel. Identifiers are unique within a dataset, but may not be unique across datasets. Thus, the combination of (SOURCE-KEY, IDENTIFIER) is needed to identify molecules uniquely. We obtain the canonical SMILES by using the following Open Babel command:

obabel {input_filename} -O {output_filename} -ocan -e

3.2 Molecular Fingerprints

We use RDKit [landrum2013rdkit] (version 2019.09.3) to compute a 2048-bit fingerprint for each molecule. We organize these fingerprints in CSV files with each row with columns SOURCE-KEY, IDENTIFIER, SMILES, FINGERPRINT, where SOURCE-KEY, IDENTIFIER, and SMILES are as defined in Table 2, and FINGERPRINT is a Base64-encoded representation of the fingerprint. In Figure 2, we show an example of how to load the fingerprint data from a batch file within individual dataset using Python 3. Further examples of how to use fingerprints are available in the accompanying GitHub repository [covid-analyses-repo].

Figure 2: A simple Python code example showing how to load data from a fingerprint file. (This and other examples are accessible on GitHub [covid-analyses-repo].)

3.3 Molecular Descriptors

We generate molecular descriptors using Mordred [moriwaki2018mordred] (version 1.2.0). The collected descriptors (1800 for each molecule) include descriptors for both 2D and 3D molecular features. We organize these descriptors in one directory per source dataset, each containing one or more CSV files. Each row in the CSV file has columns SOURCE-KEY, IDENTIFIER, SMILES, DESCRIPTOR … DESCRIPTOR. In Figure 3, we show how to load the data for an individual dataset (e.g., FFI) using Python 3 and explore its shape (Figure 3-left), and create a TSNE embedding [maaten2008visualizing] to explore the molecular descriptor space (Figure 3-right).

Figure 3: Molecular descriptor examples: (left) load descriptor data and (right) create a simple TSNE projection of the FFI dataset.

3.4 Molecular Images

Images for each molecule were generated using a custom script [covid-analyses-repo] to read the canonical SMILES structure with RDKit, kekulize the structure, handle conformers, draw the molecule with rdkit.Chem.Draw, and save the file as a PNG-format image with size 128128 pixels. For each dataset, individual pickle files are saved containing batches of images for ease of use, with entries in the format (SOURCE, IDENTIFIER, SMILES, image in PIL format). In Figure 4, we show an example of loading and display image data from a batch of files from the FFI dataset.

Figure 4: Molecular image examples. The examples show how to (top) load the data and (bottom) display a subset of the images using matplotlib.

4 Data Access

Providing access to such a large quantity of heterogeneous data (currently 60 TB) is challenging. We use Globus [chard2016globus] to handle authentication and authorization, and to enable high-speed, reliable access to the data stored on the Petrel file server at the Argonne Leadership Computing Facility’s (ALCF) Joint Laboratory for System Evaluation (JLSE). Access to this data is available to anyone following authentication via institutional credentials, an ORCID profile, a Google account, or many other common identities. Users can access the data through a web user interface shown in Fig. 5, facilitating easy browsing, direct download via HTTPS of smaller files, and high-speed, reliable transfer of larger data files to their laptop or a computing cluster via Globus Connect Personal or an instance of Globus Connect Server. There are more than active Globus endpoints distributed around the world. Users may also access the data with a full-featured Python SDK. More details on Globus can be found at

Figure 5: Data access with Globus. All data are stored on Globus endpoints, allowing users to access, move, and share the data through a web interface (pictured above), a REST API, or with a Python client. The user here has just transferred the first three files of descriptors associated with the E15 dataset to an endpoint at UChicago.

5 Conclusion and Future Directions

We have released to the community an open resource of molecular structures (as canonical SMILES), descriptors, 2D images, and fingerprints. We hope these data will contribute to the discovery of small molecules to combat the SARS-CoV-2 virus.

We expect forthcoming data releases to extend to molecular conformers; incorporate the results of natural language processing extractions of drugs from COVID-related literature; provide the results of molecular docking simulations against SARS-CoV-2 viral and host proteins; and include the trained machine learning models that the team is building to identify top candidates for running various, more expensive calculations.

6 Data and Code Availability

All data and code links can be found at Subsequent updates will be made available through the same web page, and further release papers will be issued as necessary. The code for the examples used in this paper can be found at

7 Acknowledgements

The data generated have been prepared as part of the nCov-Group Collaboration, a group of over 200 researchers working to use computational techniques to address various challenges associated with COVID-19. We would like to thank all the researchers who helped to assemble the original datasets, and who provided permission for redistribution.

This research was supported by the DOE Office of Science through the National Virtual Biotechnology Laboratory, a consortium of DOE national laboratories focused on response to COVID-19, with funding provided by the Coronavirus CARES Act. This research used resources of the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. Additional data storage and computational support for this research project has been generously supported by the following resources: Petrel Data Service at the Argonne Leadership Computing Facility (ALCF), Frontera at the Texas Advanced Computing Center (TACC), Comet at the San Diego Supercomputing Center (SDSC)

The work leveraged data and computing infrastructure produced in other projects, including: ExaLearn and the Exascale Computing Project [alexander2020exascale] (DOE Contract DE-AC02- 06CH11357); Parsl: parallel scripting library [babuji2019parsl] (NSF 1550588); funcX: distributed function as a service platform [chard2019serverless] (NSF 2004894); Globus: data services for science (authentication, transfer, users, and groups (see for funding); CHiMaD: Materials Data Facility [blaiszik2019data, blaiszik2016materials] and Polymer Property Predictor Database [tchoua2019creating] (NIST 70NANB19H005 and NIST 70NANB14H012)

8 Disclaimer

For All Information. Unless otherwise indicated, this information has been authored by an employee or employees of the UChicago Argonne, LLC, operator of the Argonne National laboratory with the U.S. Department of Energy. The U.S. Government has rights to use, reproduce, and distribute this information. The public may copy and use this information without charge, provided that this Notice and any statement of authorship are reproduced on all copies.

While every effort has been made to produce valid data, by using this data, User acknowledges that neither the Government nor UChicago Argonne, LLC, makes any warranty, express or implied, of either the accuracy or completeness of this information or assumes any liability or responsibility for the use of this information. Additionally, this information is provided solely for research purposes and is not provided for purposes of offering medical advice. Accordingly, the U.S. Government and UChicago Argonne, LLC, are not to be liable to any user for any loss or damage, whether in contract, tort (including negligence), breach of statutory duty, or otherwise, even if foreseeable, arising under or in connection with use of or reliance on the content displayed on this site.