Genome Sequence Classification for Animal Diagnostics with Graph Representations and Deep Neural Networks

07/24/2020 ∙ by Sai Narayanan, et al. ∙ Oklahoma State University 0

Bovine Respiratory Disease Complex (BRDC) is a complex respiratory disease in cattle with multiple etiologies, including bacterial and viral. It is estimated that mortality, morbidity, therapy, and quarantine resulting from BRDC account for significant losses in the cattle industry. Early detection and management of BRDC are crucial in mitigating economic losses. Current animal disease diagnostics is based on traditional tests such as bacterial culture, serolog, and Polymerase Chain Reaction (PCR) tests. Even though these tests are validated for several diseases, their main challenge is their limited ability to detect the presence of multiple pathogens simultaneously. Advancements of data analytics and machine learning and applications over metagenome sequencing are setting trends on several applications. In this work, we demonstrate a machine learning approach to identify pathogen signatures present in bovine metagenome sequences using k-mer-based network embedding followed by a deep learning-based classification task. With experiments conducted on two different simulated datasets, we show that networks-based machine learning approaches can detect pathogen signature with up to 89.7 available publicly upon request to tackle this important problem in a difficult domain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Pathogen identification and disease diagnostics is an ever-evolving field in veterinary medicine. The earliest and accurate pathogen detection has always been the goal of infectious disease diagnosis. Traditional methods such as bacterial and viral cultures, even though considered gold standard tests, require days to weeks to identify pathogens. Testing protocols, like Polymerase Chain Reaction (PCR) tests, based on the detection of unique DNA/RNA elements, are faster and considered to be very accurate for most of the infectious diseases in animals [24] and human beings [10]. Although well established, PCR imposes several challenges. Conventional PCR protocols are target-specific with limited multiplexing capability. This requires the need to design and develop pathogen-specific probes and primers that can function under selected thermocyclic parameters to target multiple pathogens. Unless specifically targeted, these methods provide very little information regarding other co-infections or innate susceptibilities based on the genetic make-up of the host or alterations in commensal microbiomes. This necessitates the development and maintenance of multiple target-specific standard operating protocols (SOPs), which can be significantly time-consuming for National Animal Health Laboratory Network (NAHLN) laboratories developing these protocols and the member laboratories participating in testing services.

Genetic sequencing-based techniques have proven importance in human [20] and veterinary diagnostics [37]. Decreasing trends in hardware and testing costs seen over recent years make these techniques affordable for diagnostic applications. Specifically, shotgun metagenome sequencing aims to sequence total genetic material from all sources in a clinical sample (i.e., from the host, pathogens, commensals, environmental components, etc.) without introducing bias [32]. Advancements in data analytics and machine learning have a significant impact on genome sequencing over multiple domains. Developing machine learning models to analyze large volumes of metagenome sequence data rapidly can immensely advance the field of animal disease diagnostics and help in the detection of known and emerging pathogens in a single test.

Fig. 1: The overall framework for the proposed metagenome classification pipeline.

Large scale graphs/networks based machine learning tasks are continuously proving their importance in several domains like social science [3], biochemistry [41], and biology [4]. One of such crucial tasks is to extract node and graph features, commonly known as network embedding from their complex and unstructured organization in graph topologies. Network embedding can be extracted in multiple ways ranging from graph kernels to feature extraction

. Graphs or graph vertexes mapped to low dimensional feature vector space are utilized for machine learning applications like classification and clustering. Deep neural networks have shown a tremendous capacity to capture complex patterns in high dimensional data, particularly in images 

[21], text [9], and even hardware security [30]. There have been emerging applications to learn robust representations from structured data such as graphs [31, 12]. We aim to build on these advances in a unified framework for genome sequence classification for animal diagnostics, particularly for detecting the single bacterial (Mannheimia hemolytica) infection called Bovine Respiratory Disease Complex (BRDC) in bovine (cattle) metagenomes. It is particularly important to identify this infection since it migrates to the lower respiratory tract and can compromise the immunity of the infected animal(as pneumonia) and can spread rapidly among crowded groups [18]. Hence, the diagnosis of BRDC at the earliest plays a major role in mitigating losses.

In this work we represent genome sequences as De-Bruijn graphs, experiment multiple network representation learning techniques including multiple graph kernels [35], node2vec [13], and graph2vec [27] to obtain representational features from constructed De-Bruijn graphs, and use classic machine learning and deep learning methods to identify pathogens in animal genome sequences. In particular, we use the Bovine Respiratory Disease Complex (BRDC) to validate the use of machine learning-based animal diagnostics. In this initial phase of our research, we model the single bacterial pathogen detection problem as a classification task with existing graph-based machine learning approaches. Our overall approach is shown in Figure 1. We construct De-Bruijn graphs (Section III-B to create a standard, structured representation of genome sequences. We then build vector-representations of the graph through various, existing embedding models (Section IV-A). The graph embedding is then used to distinguish between pathogen and host genome sequences using a deep neural network (Section IV), trained in a multi-task learning [6] setting.

Our contributions

are three-fold: (i) we present the first large-scale, annotated genome sequence dataset for the diagnosis of a bacterial pathogen from bovine metagenome sequences, (ii) show that De-Bruijn graphs can be extended to the diagnosis task using network embedding and deep neural networks, and (iii) to the best of our knowledge, our work obtains the state-of-the-art in utilizing graph-based representations to classify pathogens from very large metagenome sequences.

Ii Related Work

Ii-a Genome Sequencing with Machine Learning

Diagnosis of pathogen genome sequences within larger animal metagenomes have traditionally been tackled through bioinformatics approaches such as k-mer frequency-based features [19]. The k-mer representation is a compositional feature representation in bioinformatics that capture the frequency of the presence of k-length subsequences within a larger genome sequence. Such representations have shown to be effective in several metagenome diagnostic tasks such as chromatin accessibility prediction [25] and bacteria detection [8, 11], but have not hard much success in longer sequence reads. Deep learning and machine learning models have provided an automatic, higher-order feature learning beyond pre-defined motif lengths defined in k-mers

. Convolutional neural networks (CNNs), primarily used in computer vision 

[21], have shown great success in genome sequence prediction and classification by capturing powerful, hierarchical feature representations [28]

. Sequence-based models such as Recurrent Neural Networks and Long Short-Term Memory (LSTM) networks 

[15] have also been implemented to capture long-term dependencies in genome sequences [25] successfully.

Ii-B Genome Sequencing with Networks

Graphs or Networks have been a widely used structure to study genomics for a variety of problems. Examples include, genomic sub-compartment prediction [2], disease gene predictions [16], genome assembly [22] to name a few. The most common approach to represent genome sequences in networks is with k-mers, where a long genome sequence is broken into k-pairs of shorter sketches [23]. Over the years, the bioinformatics research community has introduced a variety of graph structures, using k-mers, to study genome sequences. Some of them include: De Bruijn graphs [29] and STRING graphs [26], which merges repeating genome patterns into one node in the network, linked De Bruijn graphs [38], which include metadata of genome sequences to store connectivity information, and variational De Bruijn graphs like pufferfish [1], which is introduced for efficient query processing.

Iii Preliminaries

Dataset DS500 k=6 DS5000 k=3
# of positives/negative samples 500 5000
Total samples/graphs 1000 10000
k 6 3
Avg. no. of nodes
Avg. no. of edges
# of unique node labels 4089 64
Avg. clustering coeff.
TABLE I: Datasets Summary and Basic Statistics of De-Bruijn Graphs used in Experiments

Iii-a Dataset Description

In all our experiments, we use simulated metagenomes using published bovine and its pathogen reference genomes. Reference genomes were downloaded from the NCBI nucleotide database. Base-by-base simulation of Illumina (Illumina Inc., San Diego, CA, USA) reads was generated using ART simulator, which simulates user-defined quality score distributions and error rates for the first and second Illumina sequencing reads. Sequence output, generally referred to as reads, was simulated with a minimum Q-score (Quality Score) of 28 and a maximum Q score of 35. ART follows empirical error rate calculations based on the Q-score used for simulation. Following the simulation, a decreasing ratio of the pathogen (Mannheimia haemolytica) genomic reads was added to the 5,000,000 reads of the bovine genome to simulate large metagenome datasets from bovine lung samples.

To make our experiments simple we prepare two randomly sampled datasets with equal distribution of host and pathogen (class variable) from a huge collection of metagenome sequences. We use one small data with 500 samples of each class and a large data with 5000 samples of each class.

We make dataset samples used in our experiments to be available for public upon request.

Iii-B k-mer and De-Bruijn Graphs

A metagenome sequence of length in a genome data can be represented as , where is a nucleotide and each nucleotide is represented with one of four characters: . These complex biological sequences are required to be converted into numeric features to perform many machine learning tasks. We use k-mer to break each metagenome sequence of length into small sub-sequences of length . A series of these small sub-sequences in each metagenome sequence is often considered as words in sentences and is used to extract numeric features from semantic representations in the sequence. Our aim in this work is to efficiently detect pathogen sequences with minimum algorithm run time from large corpus genome data. As we increase the k value, the complexity of extracting sub-sequences and extracting feature representations increases as given in the following sections. Thus we experiment with only minimum k values in our two data samples. Thus for the small dataset (DS500), we use and for the large dataset (DS5000), we use . However, we give machine learning models performance for multiple k in Section V.

We use a classic method to assemble the series of k-mer sub-sequences in a graph/network structure called De-Bruijn graphs. We convert each series of metagenome sub-sequences into a De-Bruijn graph , where is the set of nodes/vertexes, is the set of edges, and . Each vertex represent a k-mer and we label each vertex based on the k-mer. Also, each edge represents an ordered sequence of k-mer and where . A small toy example to convert a very small metagenome sequence into De-Bruijn graphs is given in Figure 2. Unlike given in this example the number of nodes in De-Bruijn graphs increase exponentially in real world data when the k value increases in k-mer. In Table I we give basic statistics of De-Bruijn graphs constructed from our two data samples (DS500 and DS5000). As reported in Table I, the number of nodes in a graph and total number of unique nodes (node labels) increase exponentially as the value of k increases, and because of this the amount of clustering varies significantly as given in the clustering coefficient.

Fig. 2: Toy example for mapping a sample metagenome read to De-Bruijn graph with k-mer using k=(3,6). For longer reads the graphs become bigger and more connected.

Iv Methodology

Iv-a Graph Embedding

Network/graph embedding can be of two types: mapping vertexes in a graph to a lower-dimensional feature space [13, 14] and mapping the given graph itself into a vector space [27, 40]. In this work, we consider the latter method of graph embedding. Given a set of graphs , where each graph is a set of vertexes and edges: , graph embedding is a learning function that translates a graph to a low dimensional feature vector of size . The learning function uses graph topology, learns the organization of nodes and subgraphs, and embeds the learned information with a statistical or deep learning approach. The learned objective of these feature representations is to cluster together with the topologically similar graphs. In this work, we use a set of labeled De-Bruijn graphs to learn graph embedding of each De-Bruijn graph . We exploit multiple graph embedding techniques to capture the best feature representation from our De-Bruijn graphs for the following binary classification task. In particular, we use multiple graph kernel approaches, and unsupervised network embedding approaches.

Iv-A1 Graph Kernels

Graph kernels [39] are preliminary forms of obtaining graph features based on graph substructures like shortest paths [5], random walks [17], graphlets [34], and isomorphism [33]. The main objective of graph kernels is to measure the similarity between all graph pairs in the corpus-based on the substructures mentioned above, which result in a feature matrix given that there are graphs in the corpus.

Iv-A2 Unsupervised Representation Learning

Unlike graph kernels, unsupervised network representation learning tasks learn the feature space directly from given graphs with their structure organization. We use two existing methodologies to obtain network embeddings.

  • node2vec [13] is a local, node-level embedding model which learns low-dimension representations of each node in a graph by optimizing a neighborhood preserving objective function. It uses iterations of random walks to gather neighborhood details to learn contextualized representations of nodes that preserves structural equivalence and homophily. For a De-Bruijn graph , we use node2vec embedding to construct local, node-level representations for . We obtain global, graph-level representation of the De-Bruijn graph by averaging all representations of nodes in .

  • graph2vec [27] is a transductive approach to constructing graph embedding from node labels given by Weisfeiler-Lehman graph kernel [33] and random walks. graph2vec assigns node labels given by WL graph kernels as initial representation. Similar to node2vec, node representations are averaged with representations from its neighbors. This graph representation is passed to LSTM encoder function to get final graph representations.

Iv-B Multi-task Deep Networks for Genome Classification

For classifying each genome sequence, we use a deep, multi-layer neural network. The proposed neural architecture has three sub-networks - an encoder, a classification network, and a decoder network. The encoder network is used to learn a lower-dimensional feature representation from the -dimension input embedding. This representation is called the latent space and is typically trained to ignore any noise and model the underlying pattern. A decoder network is used to reconstruct the input embedding from the latent space. Finally, the classification network aims to distinguish between the class variables using the latent space. This process is represented in Figure 1, where the entire framework is outlined. Combined, these networks are trained in a multi-task learning setting [6], where the two tasks are classification and input reconstruction, respectively.

Iv-B1 The Need for Multi-task Learning

In traditional, feed-forward neural networks used for the classification task, the training objective is to learn internal representations that allow for robust classification of the input. However, given that the genome sequences from the host and pathogen can have highly overlapping k-

mers and hence similar network embedding, the encoder can over-fit to the training distribution due to the highly specific objective function. To overcome this limitation, we propose the use of a decoder network and a classification network trained in tandem with the encoder network. The objective function function now becomes the learning of an internal representation that jointly models the underlying distribution for both classification and input reconstruction. The encoder network and the latent space are shared between the classification and reconstruction heads and hence reduces the risk of over-fitting by providing implicit regularization and reducing the representation bias in the network. Here, representation bias refers to the tendency of neural networks to learn representations that are highly specific to a certain task and associated training data that prevent the model from generalizing to unseen samples.

The multi-task loss function for the proposed framework is given by

(1)

where refers to the weighted cross entropy loss and refers to the reconstruction loss from the decoder head; and are modulating factors to balance the loss function between classification performance and reconstruction penalty. The reconstruction loss is an L2 difference between the reconstructed embedding () and the actual input embedding () and is given by , where is the dimension of the input embedding.

Note that this is different from only pre-training the encoder and decoder networks as an auto-encoder. While the auto-encoder objective is to reconstruct the original input through a compressed representation, there can exist an inherent representation bias which causes the network to produce low quality reconstructions and hence a poor latent space representation of unseen training examples. The low inter-class variation in genome sequences can cause the network to learn noisy representations and hence reduce the classification accuracy. We highlight the importance of the multi-task learning objective function in Section V

, with a neural network baseline that is trained only with the classification loss after pre-training as an autoencoder with the L-2 reconstruction loss.

Iv-B2 Implementation Details

Since the proposed architecture has a complex structure, we provide the implementation details here. The encoder network has five (5) fully connected (dense) layers. We intersperse each encoder layer with a dropout layer [36]

, with a dropout probability of

. We reduce the dimensions of the input by at each fully connected (dense) layer. The decoder network (for reconstruction) consists of two densely connected layers to increase the encoded features back to the original dimension. The classification network has two (2) densely connected layers that take the encoded representation as input and produces the genome classification. This is the only part of the network that is trained in a supervised manner. The encoder and decoder networks are trained in an unsupervised manner.

Due to the limited training data and the low inter-class variation, neural networks can over-fit to the training data and not generalize to any variations induced by noise in the observation. To overcome this, we propose the following training protocol. First, we perform a cold start

, i.e. for ten epochs, we train the encoder and decoder networks in as a traditional auto-encoder with a very low learning rate (

). Then, we freeze the decoder network and train the encoder and classifier branch in a supervised manner for epochs with a learning rate of . Finally, the entire network is train end-to-end for epochs with a learning rate of . The learning rate schedule and the varying objective functions help prevent over-fitting and learn robust features for classification.

V Experiments and Results

Baseline 47.29
Embedding Type LG SVM NN DL
SPK 53.58 0.05 53.1 58.59 0.23 63.80 0.19
WLK 55.87 0.06 53.8 57.81 0.53 61.41 0.38
GSK 58.33 0.07 52.9 56.25 0.51 60.16 0.95
RWK 59.34 0.11 50.8 54.69 0.83 56.41 0.72
node2vec 56.53 0.23 56.8 0.11 63.28 0.96 73.49 0.69
graph2vec 58 52.16 60.19 66.42
TABLE II: Classification accuracy of multiple graph embedding techniques with SVM and Deep Learning (DL) on graphs constructed with 6-mer from 1000 samples (500 - positive and 500 - negative)

In this section we give evaluation of multiple graph embedding techniques by comparing the proposed deep learning classifier with multiple classification algorithms in pathogen prediction from datasets described in Section III-A. Along with model performance we also report results obtained by tuning parameters in de-bruijn graph generation and graph embedding algorithms.

V-a Baseline Models

We conduct the following baseline experiments to extract graph embeddings in our study to compare developed models based on network embedding and deep learning classifier.

V-A1 Naive and unsupervised approach

In this simple approach, we utilize k-mers extracted as described in Section III-B from metagenome sequences itself with a simple unsupervised KMeans () algorithm to cluster host and pathogen metagenome sequences. For this approach we first generate k-mer sub-sequences ( for DS500 and for DS5000). For each metagenome sequence of length we generate a feature vector of shape , where and each component in the feature vector represents the normalized frequency of the presence of the defined k-mer subsequence. With this feature vector, we use KMeans algorithm to separate host and pathogen metagenome sequences.

V-A2 Graph kernels

We use the following 4 types of graph kernels to compute graph similarities with other graphs.

  • Shortest Path kernel (SPK) [5]: Similarity is determined by comparing lengths of shortest paths between nodes in two graphs

  • Weisfeiler-Lehman kernel (WLK) [33]: Graph similarity is calculate based on graph isomorphism

  • Graphlets Sampling kernel (GSK) [34]: Based on the number of same subgraph structures - graphlets - in two graphs

  • Random Walk kernel (RWK) [17]: It determines similarity of two graphs with the number of matching set of random walks on two graphs

Each graph kernel generate feature space matrix.

V-A3 Unsupervised Graph Representations

For both node2vec and graph2vec we use the output embedding size as , number of walks per node as , and walk length as . For node2vec we use parameters and . For graph2vec, we use parameters .

V-B Experiment Setup

Along with the performance of our deep learning classifier, we use the following vector space models with the given parameters to compare the classification performance. We use 10-fold cross validation result for all datasets and embedding types. We run 10 iterations of each cross validation test and report the average accuracy and standard deviation.

V-B1 Logistic Regression

We use simple Logistic Regression with

C, solver as lbfgs, and penalty as l2.

V-B2 Support Vector Machine

We use C-SVM classifier from the LIBSVM [7]. We used grid search and cross validation to determine the value of C and kernel type in the classifier. For all datasets and embeddings, we use and kernel type as radial bias function.

V-B3 Neural networks

We evaluate the proposed model (described Section IV and a standard neural network baseline. We denote these models as DL and NN, respectively. The NN baseline has the same network structure as the DL, but does not have the mult-task training objective. It is trained with the weighted cross-entropy loss function for classification, after pre-training for epochs as an auto-encoder.

V-C Hyperparameter Selection

For configuring the learning rate for the DL and NN models, we performed a sensitivity analysis of the learning rate using a grid search. We used a grid search on the learning rate using a log scale from to to highlight the order at which the best learning rate could be found. The process was repeated the two datasets since they had different batch sizes - (for the smaller dataset) and (for the larger dataset). The dropout probability was set to be after manual tuning between values of and . The modulating factors and (from Equation 1) were set to be and , respectively after manual tuning, with the range of values tried being and , in increments of . We find that the use of and to modulate the final loss allows for faster convergence during training.

Baseline 32.14
Embedding Type LG SVM NN DL
SPK 60.25 0.14 61.27 57.23 0.37 67.9 0.29
WLK 62.4 0.09 61.11 57.03 0.28 61.6 0.89
GSK 64.76 0.12 64.83 60.93 0.16 62.81 0.76
RWK - - - -
node2vec 78.84 0.14 83.7 81.78 0.62 89.74 0.60
graph2vec 57.45 0.01 54.15 55.83 59.62
TABLE III: Classification accuracy of multiple graph embedding techniques with SVM and Deep Learning (DL) on graphs constructed with 3-mer from 10000 samples (5000 - positive and 5000 - negative)

V-D Quantitative Evaluation

In Tables II and III we give average accuracy (in percentage) and standard deviation for DS500 and DS5000 respectively. We report the standard deviation only when it is significant i.e. when the observed standard deviation is greater than . We report results of each graph embedding model (SPK, WLK, RWK, GSK, node2vec,and graph2vec) with each classifier (LG, SVM, NN, and DL) as discussed in the previous section from 10 iterations of 10-fold cross validation. Throughout our experiments, we compare the results of our small data sample (DS500) and large data sample (DS5000).

We outperform the naïve k-mer frequency-based feature representation by a large margin on both datasets. It can be seen that the proposed deep neural network with the multi-task training objective outperforms the other baselines, including a similar neural network without the multi-task objective. The competitive performance of the logistic regression and SVM models indicate that the learned embedding (constructed from the De-Bruijin graphs) are a good feature representation of the genome sequences.

Among the various embedding approaches, node2vec outperforms all other approaches by a large margin on both datasets, providing gains of over the naïve k-mer frequency features and over the closely related graph2vec embedding, on the DS500 dataset. Similar gains can be seen on the DS5000 dataset. Among graph kernels, the shortest path graph kernel (SPK) embedding provides a competitive performance and the graphlets sampling kernel (GSK) is more resilient across the different classifier approaches.

V-E Ablative Studies

We also evaluate different variations of our approach to test the effectiveness of each component in the framework. Namely, we evaluate two variations: (i) the effect of the length of sub-sequence, and (ii) the effect of multi-task learning.

V-E1 Length of Sub-sequence (k)

First, we vary the sub-sequence length used to construct the De-Bruijin graphs and evaluate the performance of our full model (DL) on both the DS500 and DS5000 datasets. We use two of the best performing embedding models (node2vec and shortest path graph kernel) and summarize the results in Figure 3. We find that node2vec outperforms SPK at all values of . Interestingly, increasing has a consistently detrimental effect on both embedding methods across the two datasets. This could arguably be attributed to the fact that the number of unique nodes increases with increase in and hence the amount of variability is more pronounced. The node2vec embedding is the most affected with the change in the sub-sequence length, especially for the DS500 dataset, with the accuracy dropping by almost . The SPK embedding provides more consistent performance across the sub-sequence lengths. This can be attributed to the presence of less tottering walks [5] in the De-Bruijin graphs that are constructed from the genome sequence.

V-E2 Effect of Training Objective

We also perform ablations on the proposed deep learning model. Specifically, we keep the network structure static and vary the training objective to not include the multi-task learning setting. Instead, we train the network in a single-task, classification setting, as is the case with traditional neural network applications. As can be seen from both Table II and Table III, the use of the multi-task objective function provides consistent improvements on both datasets and across most embedding methods. On average, the multi-task objective provides an average improvement of on the DS500 dataset and on the DS5000 dataset, indicating the tendency to prevent over-fitting on smaller datasets, for which it was designed. Additionally, we find that training the network without a decoder network, reduces the accuracy of the model by an average of on the DS500 dataset and by on the DS5000 dataset.

Fig. 3: Accuracy of deep learning classification for graph embedding with shortest path graph kernel and node2vec with for (a) DS500 and (b) DS5000 datasets

Vi Conclusion and Future work

Detection of pathogens like Mannheimia Hemolytica is of huge importance in animal diagnostics, particularly with BRDC. In this work we proved that machine learning models equipped with network embedding and deep learning classifier can help identifying pathogens metagenome sequences from host metagenome sequences. With our experiments we showed that the adapted machine learning approach help predicting pathogen metagenome sequences with a huge margin in accuracy from baseline models.

This work is conducted completely on small and large data samples from simulated genomic data to validate the role of machine learning in veterinary medicine. For simplicity we generated the simulated genome sequences to contain only single pathogen in the host and also generated balanced sequences of host and pathogen. In the real world data, this scenario is quite the opposite. The real world animal genome sequences have limited pathogen sequences and the data is completely unbalanced. Given the better performance from this work, there are multiple ways to continue this work in the future. An end-to-end framework specifically for pathogen detection in animal genome sequences can be developed. This framework can utilize ensemble learning to learn from multiple learning models to obtain best prediction accuracy. Importantly, more sophisticated models that learn from unbalanced data distribution to predict multiple pathogen sequences can be proposed. Also pathogen sequences are relative to host. For example, the pathogen used in this work - Mannheimia Hemolytica - is not necessarily a pathogen for all other hosts. This relative mapping between host and pathogen can be considered for future prediction models.

Acknowledgment

This research was supported in part by the US Department of Agriculture (USDA) grants AP20VSD and B000C011.

References

  • [1] F. Almodaresi, H. Sarkar, A. Srivastava, and R. Patro (2018) A space and time-efficient index for the compacted colored de bruijn graph. Bioinformatics 34 (13), pp. i169–i177. Cited by: §II-B.
  • [2] H. Ashoor, X. Chen, W. Rosikiewicz, J. Wang, A. Cheng, P. Wang, Y. Ruan, and S. Li (2020)

    Graph embedding and unsupervised learning predict genomic sub-compartments from hic chromatin interaction data

    .
    Nature communications 11 (1), pp. 1–11. Cited by: §II-B.
  • [3] A. Bagavathi, P. Bashiri, S. Reid, M. Phillips, and S. Krishnan (2019) Examining untempered social media: analyzing cascades of polarized conversations. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 625–632. Cited by: §I.
  • [4] D. L. Barabási and A. Barabási (2020) A genetic model of the connectome. Neuron 105 (3), pp. 435–445. Cited by: §I.
  • [5] K. M. Borgwardt and H. Kriegel (2005) Shortest-path kernels on graphs. In Fifth IEEE international conference on data mining (ICDM’05), pp. 8–pp. Cited by: §IV-A1, 1st item, §V-E1.
  • [6] R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §I, §IV-B.
  • [7] C. Chang and C. Lin (2011)

    LIBSVM: a library for support vector machines

    .
    ACM transactions on intelligent systems and technology (TIST) 2 (3), pp. 1–27. Cited by: §V-B2.
  • [8] C. Deneke, R. Rentzsch, and B. Y. Renard (2017) PaPrBaG: a machine learning approach for the detection of novel pathogens from ngs data. Scientific reports 7 (1), pp. 1–13. Cited by: §II-A.
  • [9] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I.
  • [10] E. M. Elnifro, A. M. Ashshi, R. J. Cooper, and P. E. Klapper (2000) Multiplex pcr: optimization and application in diagnostic virology. Clinical microbiology reviews 13 (4), pp. 559–570. Cited by: §I.
  • [11] A. Fiannaca, L. La Paglia, M. La Rosa, G. Renda, R. Rizzo, S. Gaglio, A. Urso, et al. (2018) Deep learning models for bacteria taxonomic classification of metagenomic data. BMC bioinformatics 19 (7), pp. 198. Cited by: §II-A.
  • [12] A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur (2017) Protein interface prediction using graph convolutional networks. In Advances in Neural Information Processing Systems, pp. 6530–6539. Cited by: §I.
  • [13] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §I, 1st item, §IV-A.
  • [14] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §IV-A.
  • [15] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §II-A.
  • [16] S. Hwang, C. Y. Kim, S. Yang, E. Kim, T. Hart, E. M. Marcotte, and I. Lee (2019) HumanNet v2: human gene networks for disease research. Nucleic acids research 47 (D1), pp. D573–D580. Cited by: §II-B.
  • [17] U. Kang, H. Tong, and J. Sun (2012) Fast random walk graph kernel. In Proceedings of the 2012 SIAM international conference on data mining, pp. 828–838. Cited by: §IV-A1, 4th item.
  • [18] C. L. Klima, T. W. Alexander, S. Hendrick, and T. A. McAllister (2014) Characterization of mannheimia haemolytica isolated from feedlot cattle that were healthy or treated for bovine respiratory disease. Canadian Journal of Veterinary Research 78 (1), pp. 38–45. Cited by: §I.
  • [19] S. Koren, B. P. Walenz, K. Berlin, J. R. Miller, N. H. Bergman, and A. M. Phillippy (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research 27 (5), pp. 722–736. Cited by: §II-A.
  • [20] C. U. Köser, M. J. Ellington, E. J. Cartwright, S. H. Gillespie, N. M. Brown, M. Farrington, M. T. Holden, G. Dougan, S. D. Bentley, J. Parkhill, et al. (2012) Routine use of microbial whole genome sequencing in diagnostic and public health microbiology. PLoS pathogens 8 (8), pp. e1002824. Cited by: §I.
  • [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §I, §II-A.
  • [22] Y. Lin, J. Yuan, M. Kolmogorov, M. W. Shen, M. Chaisson, and P. A. Pevzner (2016) Assembly of long error-prone reads using de bruijn graphs. Proceedings of the National Academy of Sciences 113 (52), pp. E8396–E8405. Cited by: §II-B.
  • [23] G. Marçais, D. DeBlasio, and C. Kingsford (2018) Asymptotically optimal minimizers schemes. Bioinformatics 34 (13), pp. i13–i22. Cited by: §II-B.
  • [24] M. Maurin (2012) Real-time pcr as a diagnostic tool for bacterial diseases. Expert review of molecular diagnostics 12 (7), pp. 731–754. Cited by: §I.
  • [25] X. Min, W. Zeng, N. Chen, T. Chen, and R. Jiang (2017) Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding. Bioinformatics 33 (14), pp. i92–i101. Cited by: §II-A.
  • [26] E. W. Myers (2005) The fragment assembly string graph. Bioinformatics 21 (suppl_2), pp. ii79–ii85. Cited by: §II-B.
  • [27] A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal (2017)

    Graph2vec: learning distributed representations of graphs

    .
    arXiv preprint arXiv:1707.05005. Cited by: §I, 2nd item, §IV-A.
  • [28] T. H. Nguyen, Y. Chevaleyre, E. Prifti, N. Sokolovska, and J. Zucker (2017) Deep learning for metagenomic data: using 2d embeddings and convolutional neural networks. arXiv preprint arXiv:1712.00244. Cited by: §II-A.
  • [29] P. A. Pevzner, H. Tang, and M. S. Waterman (2001) An eulerian path approach to dna fragment assembly. Proceedings of the national academy of sciences 98 (17), pp. 9748–9753. Cited by: §II-B.
  • [30] V. L. Ramnath, S. N. Aakur, and S. Katkoori (2019) Latent space modeling for cloning encrypted puf-based authentication. In IFIP International Internet of Things Conference, pp. 142–158. Cited by: §I.
  • [31] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §I.
  • [32] T. J. Sharpton (2014) An introduction to the analysis of shotgun metagenomic data. Frontiers in Llant Science 5, pp. 209. Cited by: §I.
  • [33] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt (2011) Weisfeiler-lehman graph kernels.. Journal of Machine Learning Research 12 (9). Cited by: 2nd item, §IV-A1, 2nd item.
  • [34] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt (2009) Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pp. 488–495. Cited by: §IV-A1, 3rd item.
  • [35] G. Siglidis, G. Nikolentzos, S. Limnios, C. Giatsidis, K. Skianis, and M. Vazirgiannis (2020) GraKeL: a graph kernel library in python.. Journal of Machine Learning Research 21 (54), pp. 1–5. Cited by: §I.
  • [36] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §IV-B2.
  • [37] M. C. F. Thomsen, J. Ahrenfeldt, J. L. B. Cisneros, V. Jurtz, M. V. Larsen, H. Hasman, F. M. Aarestrup, and O. Lund (2016) A bacterial analysis platform: an integrated system for analysing bacterial whole genome sequencing data for clinical diagnostics and surveillance. PloS one 11 (6), pp. e0157718. Cited by: §I.
  • [38] I. Turner, K. V. Garimella, Z. Iqbal, and G. McVean (2018) Integrating long-range connectivity information into de bruijn graphs. Bioinformatics 34 (15), pp. 2556–2565. Cited by: §II-B.
  • [39] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt (2010) Graph kernels. The Journal of Machine Learning Research 11, pp. 1201–1242. Cited by: §IV-A1.
  • [40] M. Zhang, Z. Cui, M. Neumann, and Y. Chen (2018) An end-to-end deep learning architecture for graph classification. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §IV-A.
  • [41] W. Zhang, J. Chien, J. Yong, and R. Kuang (2017) Network-based machine learning and graph theory algorithms for precision oncology. NPJ precision oncology 1 (1), pp. 1–15. Cited by: §I.