OMXWare, A Cloud-Based Platform for Studying Microbial Life at Scale

11/05/2019
by   Edward E. Seabolt, et al.
0

The rapid growth in biological sequence data is revolutionizing our understanding of genotypic diversity and challenging conventional approaches to informatics. Due to increasing availability of genomic data, traditional bioinformatic tools require substantial computational time and creation of ever larger indices each time a researcher seeks to gain insight from the data. To address these challenges, we pre-compute important relationships between biological entities and capture this information in a relational database.The database can be queried across millions of entities and returns results in a fraction of the time required by traditional methods. In this paper, we describeOMXWare, a comprehensive database relating genotype to phenotype for bacterial life. Continually updated,OMXWare today contains data derived from 200,000 curated, self-consistently assembled genomes. The database stores functional data for over 68 million genes, 52 million proteins, and 239 million domains with associated biological activity annotations from GeneOntology, KEGG, MetaCyc, and Reactome. OMXWare maps connections between each biological entity including the originating genome, gene, protein, and protein domain. Various microbial studies, from infectious disease to environmental health, can benefit from the rich data and relationships within OMXWare. We describe the data selection, the pipeline to create and update OMXWare, and developer tools (Python SDK and Rest APIs) which allow researchers to efficiently study microbial life at scale.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/17/2021

Modeling Protein Using Large-scale Pretrain Language Model

Protein is linked to almost every life process. Therefore, analyzing the...
11/10/2020

Biomedical Information Extraction for Disease Gene Prioritization

We introduce a biomedical information extraction (IE) pipeline that extr...
01/17/2019

BioSEAL: In-Memory Biological Sequence Alignment Accelerator for Large-Scale Genomic Data

Genome sequences contain hundreds of millions of DNA base pairs. Finding...
08/20/2020

Assigning function to protein-protein interactions: a weakly supervised BioBERT based approach using PubMed abstracts

Motivation: Protein-protein interactions (PPI) are critical to the funct...
10/16/2021

DIPS-Plus: The Enhanced Database of Interacting Protein Structures for Interface Prediction

How and where proteins interface with one another can ultimately impact ...
03/16/2020

Health State Estimation

Life's most valuable asset is health. Continuously understanding the sta...
10/06/2020

Gene Regulatory Network Inference with Latent Force Models

Delays in protein synthesis cause a confounding effect when constructing...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The invention of next generation sequencing (NGS) [30], a fast and efficient technology that measures DNA or RNA sequence based on parallel nucleotide synthesis and optical detection, has made it inexpensive and routine to sequence multi-omics data. With the exponential growth of this data, the field of bioinformatics has emerged, enabling us to parse and derive knowledge from vast amounts of sequence data [26, 32, 20]. Bioinformatic tools are applied to NGS data which deliver the DNA or RNA sequences from a sample as short strings called reads. These reads, often 50-200 bp in length, must be informatically filtered and then assembled or mapped to known references [21]

. Libraries of open source tools are available to process sequence data for a diverse set of goals. For common microbial applications, these goals include genome assembly, alignment to references, naming organisms based on conserved sequences, and identifying genes (annotation).

[6, 22, 5, 10, 19]

. From the assembled DNA bit sequence, scientists can classify the ’genotype’ of an organism

[18, 37]. From the assembled or aligned RNA bit sequence, scientists can learn which genes are turned on or "expressed." However, genotypic information does not by itself indicate an organism’s biological activity or phenotype [27, 2].

Many of the expressed RNA transcripts can be translated into proteins which deliver biological activity to the system. Within these proteins are sub-regions, structural and functional protein domains, that carry out the actual enzymatic or constructional function(s) of that protein. Libraries of software have been developed to identify protein domains[28]. These applications also associate standardized functional codes and ontologies with specific domains to link them directly to molecular function, cellular structure, biological processes, biochemical pathways, etc. [4, 14, 28]. Large databases have been created to catalog domains and their functional codes, but work still needs to be done to achieve greater inter-operability and to design convenient application interfaces [33, 3].

Using existing tools, a researcher seeking to link genotype to phenotype for a particular genome must manage a large number of text files created by existing and often disparate software artifacts. These include files containing raw sequence reads, artifacts from assemblies, annotated genes, proteins, etc. The data represented is also likely to have identifiers that are unique to the source repository and do not necessarily connect data types across references. Additionally, since important genes with identical sequence may exist more than once in a genome or collection of genomes, these files also contain redundant data. Even for a small bioinformatic study, this traditional approach can consume vast amounts of storage. Thus, current bioinformatic methods need to evolve in order to address the demands of processing ever-larger collections of genomic data.

In this paper, we report an application of big data techniques and relational database technology to demonstrate an alternative approach to classical bioinformatics. We call this system OMXWare. It delivers a database that is magnitudes larger than public references. The data contained within it better captures true natural sequence diversity from genotype to phenotype, and provides a powerful framework for querying data to address questions of microbial life at scale. Leveraging existing annotation tools, like Prokka [31] and InterProScan [28], we identified biological entities from public bacterial whole genome shotgun sequence data. The entity relations are based on the actual relationships between biological objects and their properties. The entities and relations are stored in a relational database to optimize storage and query time. Construction of the database itself is a compute and memory intensive process that runs on the cloud. Over 10 million compute hours went into the discovery of biological entities and their relations. Here we detail the pipeline used for the annotation process, the database itself, and indices created to optimize performance. In addition, a developer toolkit (Python SDK and APIs) has been implemented to provide a collaborative platform that facilitates querying the database and to support scientific discovery and development of custom applications.

2 Data Description

2.1 NCBI Sequence Data

The National Center for Biotechnology Information (NCBI) maintains a large, public domain repository of raw sequence data sets in the NCBI Sequence Read Archive (SRA) [20]. From this data, as we describe in Section 3.1, we initially selected 159,628 BioSample accessions of raw Illumina paired-end bacterial isolate whole genome sequence (WGS) data. We augmented this data with 6,781 high quality assembled genomes maintained in the NCBI Reference Sequence (RefSeq) Complete genome collection. This selection provided us with 166,409 bacterial genome data sets, representing 517 genera of bacteria. These numbers have grown to over 200,000 genomes as our NCBI monitor (described in Section 3.5) continuously adds new genomes to the OMXWare database. The quality control rules and curation process are described in the Methods Section 3. Genome accessions for the initial data sets are available in the supplement.

2.2 NCBI Metadata

From the NCBI BioSample database, we retrieve descriptive information about the genomic data stored within NCBI. We download all metadata related to bacterial genomes within NCBI, including BioSample and BioProject, among others. For a single bacterial genome data set, there are metadata fields which describe characteristics such as isolation source, collection geography, contributing wet lab, etc. We ingest into the database and index metadata from each of the NCBI reference databases, including BioProject, BioSample, GenBank, Pathogen, RefSeq, SRA, and Taxonomy. Due to the undefined naming conventions (e.g. over 1500 attributes from only BioSample), we did not impose a structure to the metadata, but rather ingest and provide access to all fields from the seven sources.

The NCBI metadata is important as it allows for experimental classification of the sequence data. The metadata artifacts are related to the biological entities, so they can be essential sources of ground truth data in downstream analysis. For example, by indexing BioSample, specifically antimicrobial assay data, in relation to the genomic accession, we can annotate the genome, the assay data and the antimicrobial resistance phenotype. Such information can then be used for analysis when investigating genomic signatures contributing to antimicrobial resistance at a macro scale. Therefore, it is crucial to ingest not only the sequence data, but also the corresponding metadata, in order to allow experimentally-informed analysis.

2.3 Breadth and Depth of Biological Data

Next generation sequencing has been applied to the characterization of microbial life for over a decade, yet, we have only characterized a fraction of the total microbial diversity present on Earth. Projects such as the Earth Microbiome Project [34], Human Microbiome Project [12], 100K Pathogen Project [35], and others aim to expand the available sequence data to describe microorganisms. However, many researchers have adopted usage of current standard references such as NCBI RefSeq Complete. As of Oct 2019, NCBI RefSeq "Assembly Level" Complete contained 14,800 bacterial genomes.[26] This does not sufficiently capture microbial biodiversity in a comprehensive way. Without expansion, references of this nature will result in false negatives during microbial analysis. In contrast, OMXWare, containing over 200,000 genomes, also retrieves data from other NCBI databases including the Sequence Read Archive (SRA) which is a significant expansion in the breadth of microbial sequence data sets. We discuss our curation and selection process in section 3 to highlight the quality control methods applied in OMXWare. Incorporating a breadth of microbial sequences is necessary and essential to support accurate genomic analysis[15, 16].

Additionally, microbial genomes are highly dynamic. Microbes evolve function when under environmental stress e.g. when exposed to heat, pH extremes, antibiotics, etc. and will readily exchange DNA with other microorganisms in their niche. They do this, for example, by exchange of plasmids and conjugative transposons. This results in an ever expanding set of genotypic information; therefore, large numbers of genomic sequences are required for any genus to comprehensively capture the true underlying diversity[16]. OMXWare has an increased depth of data describing a single microbe compared to common references. For a given organism, OMXWare houses tens of thousands of sequences, providing more accurate representation of genomic variation. Table I shows a comparison genome data availability for a subset of genera (common and more rare) in OMXWare versus other curated data sources, specifically Joint Genome Institute (JGI) Integrated Microbial Genomics (IMG) and Microbiomes (IMG/M)[24], Genome Taxonomy Database (GTDB) [25], and Ensembl Bacteria [1].

Genus JGI GTDB Ensembl OMXWare
Salmonella 3,099 7,876 3,986 36,242
Weissella 30 41 19 20
Camplylobacter 734 3,524 358 15,111
TABLE I: Comparison of Quantity of Genomes (per Genus)

Similar comparisons can be done based on the number of annotated genes, proteins, and domains present in different repositories. Table II shows the number of gene sequences with a specific search substring identified and stored within the OMXWare database compared to NCBI Nucleotide (nt) and Ensembl Bacteria. Table III shows the corresponding data for protein and domain annotations compared to The M5nr database[36].

Gene NCBI nt Ensembl OMXWare
ftsZ 69,000 52,000 208,000
crispr 7,900 150,000 500,000
TABLE II: Comparison of Quantity of Genes (per search substring)
M5nr OMXWare
Total Protein Sequences 17,725,381 51,362,234
Total Functional Codes 6,626,200 233,418,698
TABLE III: Comparison of Quantity of Protein and Functional Codes

The numbers in the above tables continue to grow as OMXWare continually retrieves the latest sequence data contributed by researchers from around the world (see Section 3.5 NCBI Monitoring).

3 Methods

3.1 Genome Curation and Selection

The quality of WGS and accuracy of metadata maintained by NCBI varies dramatically. For instance, NCBI includes some data sets derived from contaminated samples (not true bacterial isolates). Other microbial data sets have been assigned an incorrect taxonomic identifier, which compromises identification accuracy in downstream analyses. Additionally, some sequenced isolates do not contain sufficient depth or quality of sequencing to adequately describe the genotype itself. To address these data inconsistencies, OMXWare systematically curates NCBI’s microbial WGS data using the methods outlined below.

Most bacterial genomes contain one circular chromosome and a small number of plasmids. Genome assemblies based on fewer, longer contiguously de novo assembled regions (contigs) more accurately represent the true underlying genome structure compared to assemblies comprised of larger numbers of short contigs. Furthermore, more contiguous genome assemblies yield more complete and more accurate gene, protein, and functional domain annotations. Therefore, ingestion of data into OMXWare is subject to a number of stringent quality conditions. To ensure good genome coverage, only raw datasets 100MB in size were processed. These are converted to FASTQ format using the NCBI SRA Toolkit Technology [32]. To maintain high quality genomes, we discared assemblies containing more than than 150 contigs of size > 500 bp or with an N50 less than 100kb. Genome assembly method details are included in Section 3.2). For one genus, Shigella, less stringent rules, excluding assemblies containing greater than 500 contigs (of size > 500 bp) and an N50 of less than 15,000 bp, were used to compensate for the known naturally occurring repetitive sequences in its genome. With these selection methods, we ensure that the genomes ingested into the database represent bacterial isolate genomes with high fidelity. From the original corpus of NCBI assemblies, 172,672 bacterial genomes met these curation rules.

OMXWare ingests only genomes with valid genus level classification. Therefore, our curation process included steps to ensure that no mislabeled, contaminated, or chimeric genomes were included. We created a Kraken reference populated from 6,781 RefSeq Complete genomes using the NCBI taxonomy [37, 26], where unique k-mers are constructed into a tree guided by this taxonomy. Using this reference, each of the 172,672 genomes was classified on a contig-by-contig basis to their lowest common ancestor (LCA) against the taxonomy tree. The resulting k-mer hits per node were rolled up to the genus level. This analysis provided a measure of both the k-mers matched by a genus and the k-mers not in the reference (N k-mers). For the majority of genomes, all the k-mers matched exclusively to one genus. However, some genomes contained k-mers ( N ) that matched to one or more genera which were different from the labeled genus which indicated potential contamination, chimerism, or mislabeling. Adding such a genome would degrade the performance and accurate use of the reference database. Therefore, we investigated the ratio of information gained to information lost by the addition of a genome to the reference. All genomes with a ratio of larger than 20:1 gained to lost k-mers were added to the original RefSeq Complete Kraken database, and all other genomes were set aside as indeterminant genomes. The indeterminate genomes were then reclassified on the now larger reference on a contig-by-contig basis, using the same ratio threshold to determine if the genome should be included. From this analysis, we found that 159,628 genomes were well classified representatives of their designated genus. In combination with the original 6,781 RefSeq Complete genomes, this yielded a high quality collection of 166,409 bacterial genomes. By curating genomes in this manner, 13,044 genomes were rejected which may have resulted in a loss of unusual biological samples. However, it would not be biologically consistent to add such samples to this reference for genus-level analysis, as there is not enough genotypic content to define that genus.

Through these curation steps, we ensure that the ingested genomes into OMXWare represent biological genomic structure and are classified at the genus level. This allows for more accurate downstream analysis as we are more closely representing true biological attributes than the unfiltered NCBI WGS sample data.

Fig. 1: Curation and annotation pipelines

3.2 Genome Assembly and Annotation

Given the quality thresholds applied in our selection process, we next describe the pipeline process of assembly. Figure 1 represents the individual stages executed for the assembly pipeline process. To ensure the highest quality data is processed for the assembly, several pre-assembly steps are performed such as removing low quality reads, contamination from adapters, or PhiX internal sequencing control. In order to reduce unassembled contigs and increase N50 value, we use FLASh [22] to merge paired-end reads and to improve the overall quality of the resulting assembly. Once pre-assembly steps are complete, the merged reads are assembled in an iterative genome assembly/quality evaluation process with SPAdes and QUAST to optimize for the most complete genome assemblies[5, 10]. This dynamic processing allows each isolate to use the most optimized parameters for the data type and yield the highest quality assembly. By adding this novel step to our pipeline, we are able to improve the assembly quality significantly from what is available in the public domain. For example, Figure 2 highlights an exemplar Acinetobacter genome that in the public domain contains 4,718 contigs. From that same raw sequence data, the OMXWare genome assembly pipeline was able to achieve a more continuous genome assembly of 131 contigs. Increased continuity more accurately represents bacterial genome structure and provides better starting data for gene, protein, and functional annotation. The genomes are further processed for taxonomic naming accuracy as described in Section 3.1.

Fig. 2: Genome Assembly Improvements Beyond Public References Bandage plot indicating regions within a genome assembly that can be mapped into contiguous regions (contigs). A more complete assembly (less contigs) is more representative of biological truth.

3.3 Gene and Protein Annotation

As shown in the second pipeline of Figure 1, after genome assembly, the genes and proteins are annotated. First, genes and proteins are discovered from the assembly using Prokka[31]. Next, the generated ".fna" and ".faa" files are parsed, resulting in the collation of the genome, gene, and protein data entities into CSV files. Finally, the CSV files are loaded into the appropriate tables within the database. The database schema is described in Section 4.3

3.4 Domain Annotation

In order to connect the genome to the function it contains, we must look at the domains of each protein. Protein domains are sub-strings of the protein which determine the enzymatic activity, and thus deliver the function. After gene and protein annotation, protein domains are identified using InterProScan [28]. The third pipeline in Figure 1 illustrates this phase. First, unique sequences for annotated proteins are scanned against the OMXWare database to determine if they were previously identified. The output of this scan produce a reduced set of new protein sequences, which greatly decreases the amount of time required to analyze the protein sequences by InterProScan. Next, InterProScan is run on the reduced protein sequence set. All 16 available analyses provided by InterProScan are run over all input sequences. Results are output in JSON format. To reduce the amount of time spent in this analysis step, we distribute each of the 16 algorithms to its own process via GNU parallel within the executing stage. We also leverage InterProScan’s ability to utilize a local network lookup service, which we have placed in a cluster and load balanced as shown in Figure 3. Next, for each of the 16 resulting JSON documents produced by InterProScan, we parse the annotated domain information into a set of CSV files. As in the gene and protein annotation step, these CSV files are loaded into the appropriate tables in DB2. By using these steps, we are able to efficiently identify the domains within the protein sequences, connecting the genotypic information to phenotype or function.

3.5 NCBI Monitoring

OMXWare is not just a static database, but increases as available bacterial information grows. NCBI reports an average annual growth rate of of bacterial genetic data[29]. Regulatory agencies such as the US Food and Drug Administration and Center for Disease Control submit sequenced isolates relating to infectious disease outbreaks from across the country on a daily basis. Additionally, it is typically required to submit raw sequence data for any biological publication. The genome sequence data may require a number of pre-processing steps before it can be used for analysis. Thus, we devised a new monitoring service allowing the database to be continually updated with the latest genome assemblies and sequence data from NCBI. The process allows the OMXWare database to contain the most current data and optimizes for high quality data meeting our previously stated curation thresholds.

OMXWare monitors 7 key databases from NCBI: BioProject, BioSample, GenBank, Pathogen, RefSeq, Sequence Read Archive (SRA), and Taxonomy. BioProject and BioSample are used to supplement and enrich biological entity search capabilities. Each genome in OMXWare is associated with its corresponding NCBI metadata record from these sources to provide additional information and to aid our users in discovering relevant data. GenBank, Pathogen, RefSeq, and SRA are used to continually update OMXWare’s bacterial genomic sequence database and yield their downstream genes, proteins and functional domains, using the methods described above. Microbial nomenclature and taxonomy can change as new genotypic and phenotypic evidence is discovered; therefore, we use the latest NCBI Taxonomy tree to ensure our genomic data is associated with the most accurate identifiers known to date.

To identify whether an assembled genome from GenBank, Pathogen, or RefSeq should be added or modified within OMXWare, the following conditions must be met:

  1. Using the taxonomic tree, we identify if the genome is of bacterial lineage.

  2. We identify if the genome has an assembly level of "Complete Genome". Per NCBI, complete genomes provide the highest quality data as all genomic data contained in the assembly is represented.

If these two conditions are satisfied, the monitor will schedule the annotation pipeline to run for the genome. This includes retrieving the assembled genome data from NCBI and annotating the genes, proteins, domains, pathways, gene ontology, etc.

For SRA or unassembled Pathogen genomes, the process is more complicated as the raw reads must go through an assembly process first. The conditions for adding or updating SRA data to OMXWare are as follows:

  1. Must be bacterial as defined by the data set’s taxonomic lineage.

  2. Library strategy must be Whole Genome Sequenced (WGS).

  3. Library source must be Genomic.

  4. Library descriptor must indicate pair-ended reads.

  5. Library platform must be Illumina (selecting short read sequencing only).

If these conditions are satisfied for SRA data, the monitor will schedule the genome assembly and annotation pipelines to run for the data set.

These monitoring processes are crucial in maintaining the relevance of data in OMXWare. It is important to update the database with the most current information and metadata, as biological knowledge grows with more sequencing data. Furthermore, with the expanding amount of sequence data, an updated relational database becomes more useful, as the methods explained above allow for optimized genome, gene, protein, and domain analysis that would be intractable otherwise.

4 Architecture

Due to the large scale of sequence data produced by this pipeline, we implemented a cloud-based architecture to effectively orchestrate the complex use of the bioinformatic tools described above across multiple servers. We created the OMXWare Distributed Pipeline Framework (ODPF) to orchestrate the execution of the open source components for genome assembly and annotation of the bacterial genomes originally pulled from NCBI and through the NCBI monitor.

4.1 System Details

We leveraged the IBM Cloud to provision and deploy essential bioinformatic tools, described in Section 3. These bioinformatic analyses require large compute time, so to optimize the runtime, we used a combination of bare metal and virtual machines totaling over 1468 CPUs, 6TB RAM, and 160TB of hard drive space. With such a large number of machines running concurrently, it was necessary to implement methods for cluster management and orchestration. Therefore, we utilized Apache Mesos for cluster resource management and scheduling [13] and Marathon for Docker container orchestration and health checking [11]. In order to efficiently monitor the execution and performance of the system, we used RabbitMQ as a broker [9] to implement a message-oriented methodology for executing pipelines. This also emits system events from all pipelines, captured in the time series database InfluxDB [23, 8]. In order to automate management and execution of the system we created the ODPF, to coordinate incoming messages from the broker, execute individual stages of a pipeline, record events as each stage progresses, and route messages to additional queues when requested. This was a crucial component to enable the computationally intensive task of assembling and annotating such a comprehensive set of genomes.

Fig. 3: OMXWare Distributed Pipeline Framework Architecture

4.2 Pipeline Execution Mechanics

A pipeline job can be started by creating a JSON document with the desired stages to execute and submit to the target queue. As a job is processed, the ODPF will emit events for the start, completion, and failure of stages. An event contains a number of fields including the context of the stage executed, environment settings from Mesos and Marathon, the host at which the process is executing, and many others. If an error is encountered during processing, ODPF will create a failure queue, denoting the name of the current queue that has failed. The failed portion of the job is sent to this queue for later inspection. Additionally, a failure event is also recorded in InfluxDB which will contain all the necessary information for an administrator to locate where in the cluster the failure occurred and why.

An instance of ODPF is encapsulated as a Docker container and deployed into a cluster through Marathon’s management console or API. Only a minimal initial configuration is required for this deployment including access to the Docker socket file, the name of the InfluxDB database to store events, and message broker host and input queue name. The Docker socket file is required as ODPF coordinates the execution of sibling containers based on the contents of the job received from the message broker. By allowing the job descriptor to describe queue routing behavior, we can fine-tune how a job is processed. For example, some bioinformatic processes tend to take longer to reach completion than others. Letting stages or invocations define their target process queue provides the flexibility for a said queue to be serviced by ODPF instances allocated with additional cluster resources. This allows jobs to be highly distributed without stalling the pipeline due to a slow stage.

4.3 Database Schema

OMXWare utilizes a relational database in order to efficiently store and retrieve the interconnected biological entities and sequence data. The relational database allows to index the data, perform quick queries, and creates high read/write access. This allows for fast operations and analysis of the annotated data at a large scale.

By using a relational database, we are able to optimize data storage, as each unique entity is stored exactly once. As shown in Figure 4, after the initial ingestion of genome accessions, there is an incremental growth rate from the NCBI Montinor (Section 3.5). For gene and protein entities, after the initial ingestion, the rate of growth is sublinear, as the rate of discovery of novel genes and proteins scales approximately as the square root of the number of genomes processed [17]. Figure 5 shows the actual rate at new entities were discovered as new genomes were added to the database. As we later introduced domain annotation to our pipeline, we see a similar trend beginning with a high discovery rate that reaches a plateau. Thus, we expect a slower rate of growth as new data is continually introduced. The long plateaus of no growth in Figures 4, 5 reflect periods in which ingestion of new data into the system was paused for testing.

Fig. 4: Number of Genomes in Database over Time. This figure shows rate of genome ingestion into the database.
Fig. 5: Number of Entities in Database over time. This figure shows the rate of discovery of new entities and ingestion into the database.

4.3.1 Entities

The biological entities within OMXWare are the central objects stored in the database. These include genomes, genes, proteins, and domains. OMXWare has a total of 5 entities and 36 tables to capture sequence data, metadata, and mapping data. Storing the entities in this manner allows for efficient storage and maintenance of relationships from the genome to the domains, with associated metadata.

  1. Genome: We store the genome accessions from NCBI, and use this as a unique id for the genome and associated assay information. We also store the validated genus and full NCBI taxonomic lineage. The genome entities are connected to the genes and proteins found within the particular genome.

  2. Gene: For each gene found within a genome, we store the annotated gene sequence and maintain this relationship between gene sequence and the genomes that contain it. Furthermore, we store the name of the gene assigned by Prokka, and associated short names, to allow users to query for genes in a traditional manner. We store the relative position of each gene on each source genome contig as reported by Prokka. The gene entities are connected to their downstream proteins.

  3. Protein: Similar to genes, for each protein, we store the annotated protein sequence and maintain the relationship to the originating genome and gene sequences. We store both the fullname and shortname for each protein sequence and the relative position of the protein sequence on each source genome contig, as given by Prokka. The protein entities are connected to their downstream domains.

  4. Domain: For each domain found within a protein, we store the name and description and a connection to the originating protein. Since domains are the distinct functional or structure units on a protein, the ability to easily query this information is crucial to connect genotype to phenotype. Thus, we represent domains using IPR codes and model these IPR codes as subentities within the database to improve performance.

  5. IPR: IPR codes are assigned to protein domains by Interpro and are a representation of domains and protein families. For each IPR code, we store its associated description, name, and type. The IPR codes are linked to their corresponding domain, and this relationship can be used to map back to the original protein, gene, and genome.

  6. Pathway: Pathways represent molecular interactions, reactions, and relation networks for biological functions. OMXWare contains pathway information from three databases: KEGG, MetaCyc, Reactome. We store the pathway code, the source database, and its description. The pathway codes are related to both domains and IPR codes to allow fast association between the domain entities and pathways. Pathways are commonly represented using gene ontology (GO) terms. We also model GO terms again as subentities.

  7. GO: The gene ontology terms are associated with the pathway information described above. We store the terms, their name, their description, and maintain a connection to the associated IPR codes. This reduces the number of joins necessary to traverse back to the originating sequence data.

4.3.2 Entity Relations

Once the entities and related data have been modeled and stored, it is important to connect these entities in the relational schema. The entity relations connect the genomes, genes, proteins and domains found through the annotation process. The structure of these relationships is described in Figure 6, which details the data stored for each main entity and the associated relationships. We can see that this relational model is crucial in reflecting the biological relationship between genotype and phenotype. In order to allow for genomic analysis at scale, it is important to maintain these connections for all annotated sequences, rather than repeatedly perform the annotation process as needed. Such relations can allow the user to draw generalizations from the data, such as identifying genera that commonly contain a particular gene. By maintaining relationships between this pre-computed data, we are able to facilitate analysis that leverages the large amount of data contained in the database.

Because of the scale of the data stored within OMXWare, traversing across the entity tables is a non-trivial task. We use mapping tables in order to maintain the relationships between entities and optimize performance. By using such mapping tables, in conjunction with DB2’s columnar table store, we are able to optimize the number of joins needed to traverse the biological entities within central dogma of molecular biology, resulting in faster queries.

Fig. 6: Simplified Entity Relationship Diagram. A simplified ER diagram is shown representing the most important entity relations in OMXWare. For performance reasons some of the tables shown above are split into primary object and object details tables.

5 User Interface

In order to allow quick and fast traversal across the "central dogma of molecular biology"[7] within the database, we surface the data to the user in a multiple ways to accommodate differing needs. We have designed an interface and developer tools with two types of users in mind, computational biologists and microbiolosts, each with varying development skills, requiring different options for access to the data. OMXWare provides a developer toolkit to allow users to programmatically access this magnitude of data while bypassing costly biological computation. In order to interact with the database, we have created a Python SDK package that utilizes Elastic Search to return the relevant information for the query. Additionally, we have developed APIs, REST service endpoints, and a browser interface to allow for non-programmatic access to the data.

5.1 Developer Tools

To perform analysis using the OMXWare data, we have created two methods to programmatically query the database. First, we have implemented a set of generalized Rest APIs, to allow a user to query for specific entities based on various attributes, relationships, and metadata. For example, a user can query for genes of a particular length, within a specific genus, or from a given host organism. These APIs are structured in this manner to allow users to specify the results to meet their scientific needs.

Furthermore, the same queries can be produced through our Python SDK. The functions within the SDK follow the same structure as the APIs, providing consistency across all platforms. Through the SDK, results can be returned as a JSON structure, dataframe, or FASTA format in order to support a variety of downstream use cases.

Users can find complete documentation for API services and SDK functions in the Develop section of the OMXWare webpage.

5.2 OMXWare Hub

We have developed a web portal to serve as the central hub for OMXWare resources. Here, a user can search for data and find summary statistics including counts of entity types within the database, community engagement posts including ongoing research findings in the News and Learn sections, and discussion about technical issues. Additionally, we have developed an Explore page that allows the user to experience the relationships captured in the database through a gallery of graphs and visualizations.

A key aspect of the OMXWare Hub is the capability to perform a key-word search across any biological entity within the database—literally over 300 million sequences. Because of the optimization of the database and the use of ElasticSearch, we are able to return thousands of matching results within milliseconds, along with all of their related entities. The various entities are also returned with their appropriate metadata, curated from NCBI, in order to provide necessary information for biological analysis.

5.3 OMXWare Apps

Collaborative development of applications on top of the OMXWare database is supported through both the user interface and GitHub. In the Develop section of OMXWare Hub, we showcase several different applications that have been built using the OMXWare data (vide infra). A user can also upload an application, with provisions to allow others to use their tool or release source code. This is meant to foster community development and reuse of novel biological tools that are built upon the OMXWare database and infrastructure. A larger scale application platform to support computationally complex applications created by users is part of ongoing work.

5.3.1 Example Application - Feature Discovery

In this section, we describe an approach to discovery of biological features relevant to a particular phenotype (e.g., virulence). Virulence phenotypes include co-occuring cellular structures such as pili, flagella, or secretions systems, as well as the ability to produce toxins or transport important ionic metals. In general, each phenotype requires several distinct proteins. We leverage the fact that bacterial genomes are gene dense and that genes expressed together often exist near each other in gene-clusters. We built an application on top of OMXWare to expand the scope of feature discovery beyond an initial gene set for a more comprehensive and robust phenotype signature. Beginning with a set of input protein domains, we first find all gene features with the known domain architecture and then perform a relative gene position analysis to find additional neighboring genes or proteins to enhance sensitivity and specificity for the phenotype.

To test feature discovery, we begin with a list of 34 unique gene names known to be relevant to a particular virulence phenotype. This gene set was introduced by a domain expert and it yielded 146 related domain codes and domain architectures (combination of codes) within the OMXWare system. This set of domain combinations became the initial set for analysis. We then found all proteins within OMXWare that contain any of these domain combinations, which resulted in 7,710,132 unique protein sequences. We call this protein set "candidate proteins", as they are the proteins that are candidates for analysis, after filtering steps. We found all "candidate genomes" containing one or more candidate proteins. In order to prioritize phenotype enrichment (i.e. genomes with the most virulence features), we include only candidate genomes that contain more than 10 of the original domains, called "in-group genomes". All the other candidate genomes form the "out-group genome" set. This is to avoid including genomes that may not display the virulence phenotype in the analysis. For all candidate proteins, we calculated the log-scale ratio of number of occurrences within an in-group genome to the number of occurrences in an out-group genome. We then select only those candidate proteins that are at a log-scale ratio of 0 or higher (i.e. selecting those that are present in the in-group more frequently than the out-group), and called this set "pivot proteins". Details about the count of each entity before and after filtering can be found in Table IV.

Entity Description Count
Input Genes 34
Domain Codes 146
Candidate Protein (unique name) 4,272
Candidate Protein (unique sequence) 7,710,132
Candidate Genome 193,223
In-Group Genome 3,805
Pivot Protein (unique name) 3,993
Pivot Protein (unique sequence) 7,691,624
TABLE IV: Counts of Entities during Data Selection

The pivot proteins are then used to find adjacent (nearest neighbor) proteins on the same genome(s). We retrieve the neighbor proteins from OMXWare using the Prokka index of the pivot protein on a given genome contig and return the two proteins at location and on the same contig. Prokka provides this relative index of proteins on contigs. As discussed in Section 3.1, the increased continuity of our genome assemblies allows for high accuracy when investigating the relative position of proteins. Proteins located on the end of a contig will yield some false neighbors, but for high quality assemblies with fewer long contigs, these false neighbors are infrequent and easily filtered based on frequency. Evidence for index consistency is provided by Figure 7, where we plot the counts for each putative neighbor of one input domain code. We can see that there are 10 neighbor proteins that are identified consistently across all genomes, with a long tail of proteins that are identified only once. Filtering out all neighbor proteins that only occur once, we can find a consistent set of neighbor proteins using Prokka’s relative index.

Fig. 7: Count of neighbor proteins for IPR17660. This figure shows the number of times each protein name occurs as a neighbor of IPR17660 across all genomes.

If the neighbor protein was also one of the original pivot proteins, we call that a co-pivot protein. If the neighbor protein was not included in the original set, this is a discovered protein. Using this discovered protein set, we can augment the known proteins that contribute to the particular virulence phenotype. Using this procedure, we found 114 uniquely named pivot proteins and 66 uniquely named discovered proteins for virulence, across all genomes in the database. The results are summarized in Table V.

Entity Description Name Count Sequence Count
Co-Pivot Protein 114 284
Discovered Proteins 66 84
TABLE V: Counts of Proteins found in Neighbor Identification

The structure of the OMXWare database is crucial to this analysis as it was possible to quickly traverse from domains to proteins to genomes to identifying adjacent proteins. Without a relational database to facilitate these connections, such analyses would not be possible on a large scale. Furthermore, the application has been contributed back to the community and deployed as a generalized micro-service to allow users to input any set of domains or proteins relevant to their study. The structure of the APIs created for this example application support feature discovery for any phenotype of interest. We have pre-computed example inputs for several phenotypes, including virulence, to allow for exploration of this application. The application can be accessed through the OMXWare Applications page and can be launched from the user interface for ease of use.

6 Conclusion

A relational database linking genotype to phenotype across over 1,000 genera of microbial life transforms the way in which bioinformatic questions can be asked and answered today. The database grows only at the rate new features are discovered in real-time as genome assemblies, annotations, and relations are continually updated and stored uniquely in a compute cloud. Previously observed biological entities and their relations need not be recomputed when identified again as new data is added to the system. This property is essential to keeping pace with the rate of newly sequenced samples in the age of big data. Most importantly, the computational work required to build the database itself means the answers to many biological questions have been pre-computed and can be retrieved by simply querying the database. The study and discovery of co-occurring virulence features described above is an exemplar of this property. For this research, there was no need to build static indices or workflows as required for traditional bioinformatic tools e.g. BLAST[18], Kraken[37], or Bowtie2[19] among others, but instead a new query simply had to be executed to map targeted proteins and then to find their neighbors. This shift in paradigm supports streaming of biological data as biological entities are updated in real-time instead of the traditional method of building references from static files.

The database and system reported here demonstrates how bioinformatics can scale in the age of big sequence data. At the time of this report, the database contains extensive metadata related to over 200,000 high quality bacterial genomes, over 68 million unique genes, over 52 million unique proteins, and over 239 million unique protein domains. Relationships between genomes, genes, proteins, and protein domains is central to connecting genotype to phenotype. The OMXWare APIs, SDK, and services endpoints allow one to build and test new applications for diverse research interests.

6.1 Ongoing work

In order to provide ease use as a research platform, we are taking key steps to augment the OMXWare system. Currently, we are in a limited release and plan to expand user access and capabilities in the coming year. We are developing an application framework to allow users to contribute applications to the system and to leverage a broad user community. As the application framework expands, our goal is to increase the benefit to the scientific community by allowing an individual user to make their applications accessible through the OMXWare platform. Additionally, in order to further support microbial analysis, we are developing capabilities to assist users in uploading and analyzing their own sequence data. It is important to allow a researchers to compare their own microbial data to the OMXWare reference data. To this end, we are developing a method for data upload into new OMXWare schema instances and to allow for interoperability between user generated data and OMXWare reference data.

Acknowledgment

The authors would like to acknowledge Dr. C. A. Elkins, Ph.D. and his team at US Centers for Disease Control and Prevention for inspiring the prototype Virulence Feature Discovery App and for providing the first features for testing. We would also like to acknowledge Dr. N. Haiminen, Dr. L. Parida, and Dr. M. Kunitomi at IBM Research for helpful discussion and insights into other new uses of OMXWare.

References

  • [1] B. L. Aken et al. (2017) Ensembl 2017. Nucleic Acids Res. 45, pp. D635–D642. Cited by: §2.3.
  • [2] B. Alberts, A. Johnson, J. Lewis, D. Morgan, M. Raff, K. Roberts, and P. Walter (2014) Molecular biology of the cell, 6th edn new york. NY: WW Norton & Company. Cited by: §1.
  • [3] R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, et al. (2016) UniProt: the universal protein knowledgebase. Nucleic acids research 45 (D1), pp. D158–D169. Cited by: §1.
  • [4] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, et al. (2000) Gene ontology: tool for the unification of biology. Nature genetics 25 (1), pp. 25. Cited by: §1.
  • [5] A. Bankevich, S. Nurk, D. Antipov, A. A. Gurevich, M. Dvorkin, A. S. Kulikov, V. M. Lesin, S. I. Nikolenko, S. Pham, A. D. Prjibelski, et al. (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology 19 (5), pp. 455–477. Cited by: §1, §3.2.
  • [6] A. M. Bolger, M. Lohse, and B. Usadel (2014) Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics 30 (15), pp. 2114–2120. Cited by: §1.
  • [7] F. Crick (1970) Central dogma of molecular biology. Nature 227 (5258), pp. 561. Cited by: §5.
  • [8] C. J. Date and C. J. White (1984) A guide to db2: a user’s guide to the ibm product ibm database 2, a relational database management system for the mvs environment and its companion products qmf and dxt. Addison-Wesley Reading, MA. Cited by: §4.1.
  • [9] D. Dossot (2014) RabbitMQ essentials. Packt Publishing Ltd. Cited by: §4.1.
  • [10] A. Gurevich, V. Saveliev, N. Vyahhi, and G. Tesler (2013) QUAST: quality assessment tool for genome assemblies. Bioinformatics 29 (8), pp. 1072–1075. Cited by: §1, §3.2.
  • [11] S. Hoque, M. S. de Brito, A. Willner, O. Keil, and T. Magedanz (2017) Towards container orchestration in fog computing infrastructures. In 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC), Vol. 2, pp. 294–299. Cited by: §4.1.
  • [12] Human Microbiome Project Consortium (2012) A framework for human microbiome research. Nature 486, pp. 215–221. Cited by: §2.3.
  • [13] D. Kakadia (2015) Apache mesos essentials. Packt Publishing Ltd. Cited by: §4.1.
  • [14] M. Kanehisa and S. Goto (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research 28 (1), pp. 27–30. Cited by: §1.
  • [15] J.H. Kaufman et al. (2019/6) A big data analysis of integrative conjugative elements within and between bacterial genera. In ASM Microbe 2019 Session AAR01, Cited by: §2.3.
  • [16] J.H. Kaufman (2019) Insular microbiogeography: three pathogens as exemplars. Addison-Wesley Reading, MA. Cited by: §2.3, §2.3.
  • [17] J. Kaufman, C. Elkins, M. Davis, A. Weis, B. Huang, M. Mammel, I. Patel, K. Beck, S. Edlund, D. Chambliss, et al. (2019) Insular microbiogeography: three pathogens as exemplars.. Current issues in molecular biology 36, pp. 89–108. Cited by: §4.3.
  • [18] W. J. Kent (2002) BLAT—the blast-like alignment tool. Genome research 12 (4), pp. 656–664. Cited by: §1, §6.
  • [19] B. Langmead and S. L. Salzberg (2012) Fast gapped-read alignment with bowtie 2. Nature methods 9 (4), pp. 357. Cited by: §1, §6.
  • [20] R. Leinonen, H. Sugawara, M. Shumway, and I. N. S. D. Collaboration (2010) The sequence read archive. Nucleic acids research 39 (suppl_1), pp. D19–D21. Cited by: §1, §2.1.
  • [21] L. Liu, Y. Li, S. Li, N. Hu, Y. He, R. Pong, D. Lin, L. Lu, and M. Law (2012) Comparison of next-generation sequencing systems. BioMed Research International 2012. Cited by: §1.
  • [22] T. Magoč and S. L. Salzberg (2011) FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27 (21), pp. 2957–2963. Cited by: §1, §3.2.
  • [23] S. N. Z. Naqvi, S. Yfantidou, and E. Zimányi (2017) Time series databases and influxdb. Studienarbeit, Université Libre de Bruxelles. Cited by: §4.1.
  • [24] H. Nordberg, M. Cantor, S. Dusheyko, S. Hua, A. Poliakov, I. Shabalov, T. Smirnova, IV. Grigoriev, and D. I. (2014) The genome portal of the department of energy joint genome institute: 2014 updates.. Nucleic Acids Research 42, pp. D26–31. Cited by: §2.3.
  • [25] D. Parks et al. (2018) A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. National Biotechnology. Cited by: §2.3.
  • [26] K. D. Pruitt, T. Tatusova, and D. R. Maglott (2006) NCBI reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research 35 (suppl_1), pp. D61–D65. Cited by: §1, §2.3, §3.1.
  • [27] A. Purcell (2018) Basic biology: an introduction. Basic Biology Limited. External Links: ISBN 9780473440145, Link Cited by: §1.
  • [28] E. Quevillon, V. Silventoinen, S. Pillai, N. Harte, N. Mulder, R. Apweiler, and R. Lopez (2005) InterProScan: protein domains identifier. Nucleic acids research 33 (suppl_2), pp. W116–W120. Cited by: §1, §1, §3.4.
  • [29] E. W. Sayers, M. Cavanaugh, K. Clark, J. Ostell, K. D. Pruitt, and I. Karsch-Mizrachi (2019) GenBank. Nucleic acids research 47,D1, pp. D94–D99. Cited by: §3.5.
  • [30] S. C. Schuster (2007) Next-generation sequencing transforms today’s biology. Nature methods 5 (1), pp. 16. Cited by: §1.
  • [31] T. Seemann (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30 (14), pp. 2068–2069. Cited by: §1, §3.3.
  • [32] S. Sherry and C. Xiao (2012) Ncbi sra toolkit technology for next generation sequence data. In Plant and Animal Genome XX Conference (January 14-18, 2012). Plant and Animal Genome, pp. . Cited by: §1, §3.1.
  • [33] E. L. Sonnhammer, S. R. Eddy, and R. Durbin (1997) Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins: Structure, Function, and Bioinformatics 28 (3), pp. 405–420. Cited by: §1.
  • [34] L. R. Thompson, J. G. Sanders, D. McDonald, A. Amir, J. K. Jansson, J. A. Gilbert, R. Knight, The Earth Microbiome Project Consortium, et al. (2017) A communal catalogue reveals earth’s multiscale microbial diversity. Nature 551, pp. 457–463. Cited by: §2.3.
  • [35] B. C. Weimer (2017) 100K pathogen genome project. Microbiology Resource Announcements 5 (28). External Links: Document, Link, https://mra.asm.org/content/5/28/e00594-17.full.pdf Cited by: §2.3.
  • [36] A. Wilke et al. (2012) The m5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools. BMC Bioinformatics 13 (141). Cited by: §2.3.
  • [37] D. E. Wood and S. L. Salzberg (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology 15 (3), pp. R46. Cited by: §1, §3.1, §6.