I Introduction
Pathogen identification and disease diagnostics is an everevolving field in veterinary medicine. The earliest and accurate pathogen detection has always been the goal of infectious disease diagnosis. Traditional methods such as bacterial and viral cultures, even though considered gold standard tests, require days to weeks to identify pathogens. Testing protocols, like Polymerase Chain Reaction (PCR) tests, based on the detection of unique DNA/RNA elements, are faster and considered to be very accurate for most of the infectious diseases in animals [24] and human beings [10]. Although well established, PCR imposes several challenges. Conventional PCR protocols are targetspecific with limited multiplexing capability. This requires the need to design and develop pathogenspecific probes and primers that can function under selected thermocyclic parameters to target multiple pathogens. Unless specifically targeted, these methods provide very little information regarding other coinfections or innate susceptibilities based on the genetic makeup of the host or alterations in commensal microbiomes. This necessitates the development and maintenance of multiple targetspecific standard operating protocols (SOPs), which can be significantly timeconsuming for National Animal Health Laboratory Network (NAHLN) laboratories developing these protocols and the member laboratories participating in testing services.
Genetic sequencingbased techniques have proven importance in human [20] and veterinary diagnostics [37]. Decreasing trends in hardware and testing costs seen over recent years make these techniques affordable for diagnostic applications. Specifically, shotgun metagenome sequencing aims to sequence total genetic material from all sources in a clinical sample (i.e., from the host, pathogens, commensals, environmental components, etc.) without introducing bias [32]. Advancements in data analytics and machine learning have a significant impact on genome sequencing over multiple domains. Developing machine learning models to analyze large volumes of metagenome sequence data rapidly can immensely advance the field of animal disease diagnostics and help in the detection of known and emerging pathogens in a single test.
Large scale graphs/networks based machine learning tasks are continuously proving their importance in several domains like social science [3], biochemistry [41], and biology [4]. One of such crucial tasks is to extract node and graph features, commonly known as network embedding from their complex and unstructured organization in graph topologies. Network embedding can be extracted in multiple ways ranging from graph kernels to feature extraction
. Graphs or graph vertexes mapped to low dimensional feature vector space are utilized for machine learning applications like classification and clustering. Deep neural networks have shown a tremendous capacity to capture complex patterns in high dimensional data, particularly in images
[21], text [9], and even hardware security [30]. There have been emerging applications to learn robust representations from structured data such as graphs [31, 12]. We aim to build on these advances in a unified framework for genome sequence classification for animal diagnostics, particularly for detecting the single bacterial (Mannheimia hemolytica) infection called Bovine Respiratory Disease Complex (BRDC) in bovine (cattle) metagenomes. It is particularly important to identify this infection since it migrates to the lower respiratory tract and can compromise the immunity of the infected animal(as pneumonia) and can spread rapidly among crowded groups [18]. Hence, the diagnosis of BRDC at the earliest plays a major role in mitigating losses.In this work we represent genome sequences as DeBruijn graphs, experiment multiple network representation learning techniques including multiple graph kernels [35], node2vec [13], and graph2vec [27] to obtain representational features from constructed DeBruijn graphs, and use classic machine learning and deep learning methods to identify pathogens in animal genome sequences. In particular, we use the Bovine Respiratory Disease Complex (BRDC) to validate the use of machine learningbased animal diagnostics. In this initial phase of our research, we model the single bacterial pathogen detection problem as a classification task with existing graphbased machine learning approaches. Our overall approach is shown in Figure 1. We construct DeBruijn graphs (Section IIIB to create a standard, structured representation of genome sequences. We then build vectorrepresentations of the graph through various, existing embedding models (Section IVA). The graph embedding is then used to distinguish between pathogen and host genome sequences using a deep neural network (Section IV), trained in a multitask learning [6] setting.
Our contributions
are threefold: (i) we present the first largescale, annotated genome sequence dataset for the diagnosis of a bacterial pathogen from bovine metagenome sequences, (ii) show that DeBruijn graphs can be extended to the diagnosis task using network embedding and deep neural networks, and (iii) to the best of our knowledge, our work obtains the stateoftheart in utilizing graphbased representations to classify pathogens from very large metagenome sequences.
Ii Related Work
Iia Genome Sequencing with Machine Learning
Diagnosis of pathogen genome sequences within larger animal metagenomes have traditionally been tackled through bioinformatics approaches such as kmer frequencybased features [19]. The kmer representation is a compositional feature representation in bioinformatics that capture the frequency of the presence of klength subsequences within a larger genome sequence. Such representations have shown to be effective in several metagenome diagnostic tasks such as chromatin accessibility prediction [25] and bacteria detection [8, 11], but have not hard much success in longer sequence reads. Deep learning and machine learning models have provided an automatic, higherorder feature learning beyond predefined motif lengths defined in kmers
. Convolutional neural networks (CNNs), primarily used in computer vision
[21], have shown great success in genome sequence prediction and classification by capturing powerful, hierarchical feature representations [28]. Sequencebased models such as Recurrent Neural Networks and Long ShortTerm Memory (LSTM) networks
[15] have also been implemented to capture longterm dependencies in genome sequences [25] successfully.IiB Genome Sequencing with Networks
Graphs or Networks have been a widely used structure to study genomics for a variety of problems. Examples include, genomic subcompartment prediction [2], disease gene predictions [16], genome assembly [22] to name a few. The most common approach to represent genome sequences in networks is with kmers, where a long genome sequence is broken into kpairs of shorter sketches [23]. Over the years, the bioinformatics research community has introduced a variety of graph structures, using kmers, to study genome sequences. Some of them include: De Bruijn graphs [29] and STRING graphs [26], which merges repeating genome patterns into one node in the network, linked De Bruijn graphs [38], which include metadata of genome sequences to store connectivity information, and variational De Bruijn graphs like pufferfish [1], which is introduced for efficient query processing.
Iii Preliminaries
Dataset  DS500 k=6  DS5000 k=3 
# of positives/negative samples  500  5000 
Total samples/graphs  1000  10000 
k  6  3 
Avg. no. of nodes  
Avg. no. of edges  
# of unique node labels  4089  64 
Avg. clustering coeff.  
Iiia Dataset Description
In all our experiments, we use simulated metagenomes using published bovine and its pathogen reference genomes. Reference genomes were downloaded from the NCBI nucleotide database. Basebybase simulation of Illumina (Illumina Inc., San Diego, CA, USA) reads was generated using ART simulator, which simulates userdefined quality score distributions and error rates for the first and second Illumina sequencing reads. Sequence output, generally referred to as reads, was simulated with a minimum Qscore (Quality Score) of 28 and a maximum Q score of 35. ART follows empirical error rate calculations based on the Qscore used for simulation. Following the simulation, a decreasing ratio of the pathogen (Mannheimia haemolytica) genomic reads was added to the 5,000,000 reads of the bovine genome to simulate large metagenome datasets from bovine lung samples.
To make our experiments simple we prepare two randomly sampled datasets with equal distribution of host and pathogen (class variable) from a huge collection of metagenome sequences. We use one small data with 500 samples of each class and a large data with 5000 samples of each class.
We make dataset samples used in our experiments to be available for public upon request.
IiiB kmer and DeBruijn Graphs
A metagenome sequence of length in a genome data can be represented as , where is a nucleotide and each nucleotide is represented with one of four characters: . These complex biological sequences are required to be converted into numeric features to perform many machine learning tasks. We use kmer to break each metagenome sequence of length into small subsequences of length . A series of these small subsequences in each metagenome sequence is often considered as words in sentences and is used to extract numeric features from semantic representations in the sequence. Our aim in this work is to efficiently detect pathogen sequences with minimum algorithm run time from large corpus genome data. As we increase the k value, the complexity of extracting subsequences and extracting feature representations increases as given in the following sections. Thus we experiment with only minimum k values in our two data samples. Thus for the small dataset (DS500), we use and for the large dataset (DS5000), we use . However, we give machine learning models performance for multiple k in Section V.
We use a classic method to assemble the series of kmer subsequences in a graph/network structure called DeBruijn graphs. We convert each series of metagenome subsequences into a DeBruijn graph , where is the set of nodes/vertexes, is the set of edges, and . Each vertex represent a kmer and we label each vertex based on the kmer. Also, each edge represents an ordered sequence of kmer and where . A small toy example to convert a very small metagenome sequence into DeBruijn graphs is given in Figure 2. Unlike given in this example the number of nodes in DeBruijn graphs increase exponentially in real world data when the k value increases in kmer. In Table I we give basic statistics of DeBruijn graphs constructed from our two data samples (DS500 and DS5000). As reported in Table I, the number of nodes in a graph and total number of unique nodes (node labels) increase exponentially as the value of k increases, and because of this the amount of clustering varies significantly as given in the clustering coefficient.
Iv Methodology
Iva Graph Embedding
Network/graph embedding can be of two types: mapping vertexes in a graph to a lowerdimensional feature space [13, 14] and mapping the given graph itself into a vector space [27, 40]. In this work, we consider the latter method of graph embedding. Given a set of graphs , where each graph is a set of vertexes and edges: , graph embedding is a learning function that translates a graph to a low dimensional feature vector of size . The learning function uses graph topology, learns the organization of nodes and subgraphs, and embeds the learned information with a statistical or deep learning approach. The learned objective of these feature representations is to cluster together with the topologically similar graphs. In this work, we use a set of labeled DeBruijn graphs to learn graph embedding of each DeBruijn graph . We exploit multiple graph embedding techniques to capture the best feature representation from our DeBruijn graphs for the following binary classification task. In particular, we use multiple graph kernel approaches, and unsupervised network embedding approaches.
IvA1 Graph Kernels
Graph kernels [39] are preliminary forms of obtaining graph features based on graph substructures like shortest paths [5], random walks [17], graphlets [34], and isomorphism [33]. The main objective of graph kernels is to measure the similarity between all graph pairs in the corpusbased on the substructures mentioned above, which result in a feature matrix given that there are graphs in the corpus.
IvA2 Unsupervised Representation Learning
Unlike graph kernels, unsupervised network representation learning tasks learn the feature space directly from given graphs with their structure organization. We use two existing methodologies to obtain network embeddings.

node2vec [13] is a local, nodelevel embedding model which learns lowdimension representations of each node in a graph by optimizing a neighborhood preserving objective function. It uses iterations of random walks to gather neighborhood details to learn contextualized representations of nodes that preserves structural equivalence and homophily. For a DeBruijn graph , we use node2vec embedding to construct local, nodelevel representations for . We obtain global, graphlevel representation of the DeBruijn graph by averaging all representations of nodes in .

graph2vec [27] is a transductive approach to constructing graph embedding from node labels given by WeisfeilerLehman graph kernel [33] and random walks. graph2vec assigns node labels given by WL graph kernels as initial representation. Similar to node2vec, node representations are averaged with representations from its neighbors. This graph representation is passed to LSTM encoder function to get final graph representations.
IvB Multitask Deep Networks for Genome Classification
For classifying each genome sequence, we use a deep, multilayer neural network. The proposed neural architecture has three subnetworks  an encoder, a classification network, and a decoder network. The encoder network is used to learn a lowerdimensional feature representation from the dimension input embedding. This representation is called the latent space and is typically trained to ignore any noise and model the underlying pattern. A decoder network is used to reconstruct the input embedding from the latent space. Finally, the classification network aims to distinguish between the class variables using the latent space. This process is represented in Figure 1, where the entire framework is outlined. Combined, these networks are trained in a multitask learning setting [6], where the two tasks are classification and input reconstruction, respectively.
IvB1 The Need for Multitask Learning
In traditional, feedforward neural networks used for the classification task, the training objective is to learn internal representations that allow for robust classification of the input. However, given that the genome sequences from the host and pathogen can have highly overlapping k
mers and hence similar network embedding, the encoder can overfit to the training distribution due to the highly specific objective function. To overcome this limitation, we propose the use of a decoder network and a classification network trained in tandem with the encoder network. The objective function function now becomes the learning of an internal representation that jointly models the underlying distribution for both classification and input reconstruction. The encoder network and the latent space are shared between the classification and reconstruction heads and hence reduces the risk of overfitting by providing implicit regularization and reducing the representation bias in the network. Here, representation bias refers to the tendency of neural networks to learn representations that are highly specific to a certain task and associated training data that prevent the model from generalizing to unseen samples.The multitask loss function for the proposed framework is given by
(1) 
where refers to the weighted cross entropy loss and refers to the reconstruction loss from the decoder head; and are modulating factors to balance the loss function between classification performance and reconstruction penalty. The reconstruction loss is an L2 difference between the reconstructed embedding () and the actual input embedding () and is given by , where is the dimension of the input embedding.
Note that this is different from only pretraining the encoder and decoder networks as an autoencoder. While the autoencoder objective is to reconstruct the original input through a compressed representation, there can exist an inherent representation bias which causes the network to produce low quality reconstructions and hence a poor latent space representation of unseen training examples. The low interclass variation in genome sequences can cause the network to learn noisy representations and hence reduce the classification accuracy. We highlight the importance of the multitask learning objective function in Section V
, with a neural network baseline that is trained only with the classification loss after pretraining as an autoencoder with the L2 reconstruction loss.
IvB2 Implementation Details
Since the proposed architecture has a complex structure, we provide the implementation details here. The encoder network has five (5) fully connected (dense) layers. We intersperse each encoder layer with a dropout layer [36]
, with a dropout probability of
. We reduce the dimensions of the input by at each fully connected (dense) layer. The decoder network (for reconstruction) consists of two densely connected layers to increase the encoded features back to the original dimension. The classification network has two (2) densely connected layers that take the encoded representation as input and produces the genome classification. This is the only part of the network that is trained in a supervised manner. The encoder and decoder networks are trained in an unsupervised manner.Due to the limited training data and the low interclass variation, neural networks can overfit to the training data and not generalize to any variations induced by noise in the observation. To overcome this, we propose the following training protocol. First, we perform a cold start
, i.e. for ten epochs, we train the encoder and decoder networks in as a traditional autoencoder with a very low learning rate (
). Then, we freeze the decoder network and train the encoder and classifier branch in a supervised manner for epochs with a learning rate of . Finally, the entire network is train endtoend for epochs with a learning rate of . The learning rate schedule and the varying objective functions help prevent overfitting and learn robust features for classification.V Experiments and Results
Baseline  47.29  

Embedding Type  LG  SVM  NN  DL 
SPK  53.58 0.05  53.1  58.59 0.23  63.80 0.19 
WLK  55.87 0.06  53.8  57.81 0.53  61.41 0.38 
GSK  58.33 0.07  52.9  56.25 0.51  60.16 0.95 
RWK  59.34 0.11  50.8  54.69 0.83  56.41 0.72 
node2vec  56.53 0.23  56.8 0.11  63.28 0.96  73.49 0.69 
graph2vec  58  52.16  60.19  66.42 
In this section we give evaluation of multiple graph embedding techniques by comparing the proposed deep learning classifier with multiple classification algorithms in pathogen prediction from datasets described in Section IIIA. Along with model performance we also report results obtained by tuning parameters in debruijn graph generation and graph embedding algorithms.
Va Baseline Models
We conduct the following baseline experiments to extract graph embeddings in our study to compare developed models based on network embedding and deep learning classifier.
VA1 Naive and unsupervised approach
In this simple approach, we utilize kmers extracted as described in Section IIIB from metagenome sequences itself with a simple unsupervised KMeans () algorithm to cluster host and pathogen metagenome sequences. For this approach we first generate kmer subsequences ( for DS500 and for DS5000). For each metagenome sequence of length we generate a feature vector of shape , where and each component in the feature vector represents the normalized frequency of the presence of the defined kmer subsequence. With this feature vector, we use KMeans algorithm to separate host and pathogen metagenome sequences.
VA2 Graph kernels
We use the following 4 types of graph kernels to compute graph similarities with other graphs.

Shortest Path kernel (SPK) [5]: Similarity is determined by comparing lengths of shortest paths between nodes in two graphs

WeisfeilerLehman kernel (WLK) [33]: Graph similarity is calculate based on graph isomorphism

Graphlets Sampling kernel (GSK) [34]: Based on the number of same subgraph structures  graphlets  in two graphs

Random Walk kernel (RWK) [17]: It determines similarity of two graphs with the number of matching set of random walks on two graphs
Each graph kernel generate feature space matrix.
VA3 Unsupervised Graph Representations
For both node2vec and graph2vec we use the output embedding size as , number of walks per node as , and walk length as . For node2vec we use parameters and . For graph2vec, we use parameters .
VB Experiment Setup
Along with the performance of our deep learning classifier, we use the following vector space models with the given parameters to compare the classification performance. We use 10fold cross validation result for all datasets and embedding types. We run 10 iterations of each cross validation test and report the average accuracy and standard deviation.
VB1 Logistic Regression
We use simple Logistic Regression with
C, solver as lbfgs, and penalty as l2.VB2 Support Vector Machine
We use CSVM classifier from the LIBSVM [7]. We used grid search and cross validation to determine the value of C and kernel type in the classifier. For all datasets and embeddings, we use and kernel type as radial bias function.
VB3 Neural networks
We evaluate the proposed model (described Section IV and a standard neural network baseline. We denote these models as DL and NN, respectively. The NN baseline has the same network structure as the DL, but does not have the multtask training objective. It is trained with the weighted crossentropy loss function for classification, after pretraining for epochs as an autoencoder.
VC Hyperparameter Selection
For configuring the learning rate for the DL and NN models, we performed a sensitivity analysis of the learning rate using a grid search. We used a grid search on the learning rate using a log scale from to to highlight the order at which the best learning rate could be found. The process was repeated the two datasets since they had different batch sizes  (for the smaller dataset) and (for the larger dataset). The dropout probability was set to be after manual tuning between values of and . The modulating factors and (from Equation 1) were set to be and , respectively after manual tuning, with the range of values tried being and , in increments of . We find that the use of and to modulate the final loss allows for faster convergence during training.
Baseline  32.14  

Embedding Type  LG  SVM  NN  DL 
SPK  60.25 0.14  61.27  57.23 0.37  67.9 0.29 
WLK  62.4 0.09  61.11  57.03 0.28  61.6 0.89 
GSK  64.76 0.12  64.83  60.93 0.16  62.81 0.76 
RWK         
node2vec  78.84 0.14  83.7  81.78 0.62  89.74 0.60 
graph2vec  57.45 0.01  54.15  55.83  59.62 
VD Quantitative Evaluation
In Tables II and III we give average accuracy (in percentage) and standard deviation for DS500 and DS5000 respectively. We report the standard deviation only when it is significant i.e. when the observed standard deviation is greater than . We report results of each graph embedding model (SPK, WLK, RWK, GSK, node2vec,and graph2vec) with each classifier (LG, SVM, NN, and DL) as discussed in the previous section from 10 iterations of 10fold cross validation. Throughout our experiments, we compare the results of our small data sample (DS500) and large data sample (DS5000).
We outperform the naïve kmer frequencybased feature representation by a large margin on both datasets. It can be seen that the proposed deep neural network with the multitask training objective outperforms the other baselines, including a similar neural network without the multitask objective. The competitive performance of the logistic regression and SVM models indicate that the learned embedding (constructed from the DeBruijin graphs) are a good feature representation of the genome sequences.
Among the various embedding approaches, node2vec outperforms all other approaches by a large margin on both datasets, providing gains of over the naïve kmer frequency features and over the closely related graph2vec embedding, on the DS500 dataset. Similar gains can be seen on the DS5000 dataset. Among graph kernels, the shortest path graph kernel (SPK) embedding provides a competitive performance and the graphlets sampling kernel (GSK) is more resilient across the different classifier approaches.
VE Ablative Studies
We also evaluate different variations of our approach to test the effectiveness of each component in the framework. Namely, we evaluate two variations: (i) the effect of the length of subsequence, and (ii) the effect of multitask learning.
VE1 Length of Subsequence (k)
First, we vary the subsequence length used to construct the DeBruijin graphs and evaluate the performance of our full model (DL) on both the DS500 and DS5000 datasets. We use two of the best performing embedding models (node2vec and shortest path graph kernel) and summarize the results in Figure 3. We find that node2vec outperforms SPK at all values of . Interestingly, increasing has a consistently detrimental effect on both embedding methods across the two datasets. This could arguably be attributed to the fact that the number of unique nodes increases with increase in and hence the amount of variability is more pronounced. The node2vec embedding is the most affected with the change in the subsequence length, especially for the DS500 dataset, with the accuracy dropping by almost . The SPK embedding provides more consistent performance across the subsequence lengths. This can be attributed to the presence of less tottering walks [5] in the DeBruijin graphs that are constructed from the genome sequence.
VE2 Effect of Training Objective
We also perform ablations on the proposed deep learning model. Specifically, we keep the network structure static and vary the training objective to not include the multitask learning setting. Instead, we train the network in a singletask, classification setting, as is the case with traditional neural network applications. As can be seen from both Table II and Table III, the use of the multitask objective function provides consistent improvements on both datasets and across most embedding methods. On average, the multitask objective provides an average improvement of on the DS500 dataset and on the DS5000 dataset, indicating the tendency to prevent overfitting on smaller datasets, for which it was designed. Additionally, we find that training the network without a decoder network, reduces the accuracy of the model by an average of on the DS500 dataset and by on the DS5000 dataset.
Vi Conclusion and Future work
Detection of pathogens like Mannheimia Hemolytica is of huge importance in animal diagnostics, particularly with BRDC. In this work we proved that machine learning models equipped with network embedding and deep learning classifier can help identifying pathogens metagenome sequences from host metagenome sequences. With our experiments we showed that the adapted machine learning approach help predicting pathogen metagenome sequences with a huge margin in accuracy from baseline models.
This work is conducted completely on small and large data samples from simulated genomic data to validate the role of machine learning in veterinary medicine. For simplicity we generated the simulated genome sequences to contain only single pathogen in the host and also generated balanced sequences of host and pathogen. In the real world data, this scenario is quite the opposite. The real world animal genome sequences have limited pathogen sequences and the data is completely unbalanced. Given the better performance from this work, there are multiple ways to continue this work in the future. An endtoend framework specifically for pathogen detection in animal genome sequences can be developed. This framework can utilize ensemble learning to learn from multiple learning models to obtain best prediction accuracy. Importantly, more sophisticated models that learn from unbalanced data distribution to predict multiple pathogen sequences can be proposed. Also pathogen sequences are relative to host. For example, the pathogen used in this work  Mannheimia Hemolytica  is not necessarily a pathogen for all other hosts. This relative mapping between host and pathogen can be considered for future prediction models.
Acknowledgment
This research was supported in part by the US Department of Agriculture (USDA) grants AP20VSD and B000C011.
References
 [1] (2018) A space and timeefficient index for the compacted colored de bruijn graph. Bioinformatics 34 (13), pp. i169–i177. Cited by: §IIB.

[2]
(2020)
Graph embedding and unsupervised learning predict genomic subcompartments from hic chromatin interaction data
. Nature communications 11 (1), pp. 1–11. Cited by: §IIB.  [3] (2019) Examining untempered social media: analyzing cascades of polarized conversations. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 625–632. Cited by: §I.
 [4] (2020) A genetic model of the connectome. Neuron 105 (3), pp. 435–445. Cited by: §I.
 [5] (2005) Shortestpath kernels on graphs. In Fifth IEEE international conference on data mining (ICDM’05), pp. 8–pp. Cited by: §IVA1, 1st item, §VE1.
 [6] (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §I, §IVB.

[7]
(2011)
LIBSVM: a library for support vector machines
. ACM transactions on intelligent systems and technology (TIST) 2 (3), pp. 1–27. Cited by: §VB2.  [8] (2017) PaPrBaG: a machine learning approach for the detection of novel pathogens from ngs data. Scientific reports 7 (1), pp. 1–13. Cited by: §IIA.
 [9] (2018) Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I.
 [10] (2000) Multiplex pcr: optimization and application in diagnostic virology. Clinical microbiology reviews 13 (4), pp. 559–570. Cited by: §I.
 [11] (2018) Deep learning models for bacteria taxonomic classification of metagenomic data. BMC bioinformatics 19 (7), pp. 198. Cited by: §IIA.
 [12] (2017) Protein interface prediction using graph convolutional networks. In Advances in Neural Information Processing Systems, pp. 6530–6539. Cited by: §I.
 [13] (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §I, 1st item, §IVA.
 [14] (2017) Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §IVA.
 [15] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §IIA.
 [16] (2019) HumanNet v2: human gene networks for disease research. Nucleic acids research 47 (D1), pp. D573–D580. Cited by: §IIB.
 [17] (2012) Fast random walk graph kernel. In Proceedings of the 2012 SIAM international conference on data mining, pp. 828–838. Cited by: §IVA1, 4th item.
 [18] (2014) Characterization of mannheimia haemolytica isolated from feedlot cattle that were healthy or treated for bovine respiratory disease. Canadian Journal of Veterinary Research 78 (1), pp. 38–45. Cited by: §I.
 [19] (2017) Canu: scalable and accurate longread assembly via adaptive kmer weighting and repeat separation. Genome research 27 (5), pp. 722–736. Cited by: §IIA.
 [20] (2012) Routine use of microbial whole genome sequencing in diagnostic and public health microbiology. PLoS pathogens 8 (8), pp. e1002824. Cited by: §I.
 [21] (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §I, §IIA.
 [22] (2016) Assembly of long errorprone reads using de bruijn graphs. Proceedings of the National Academy of Sciences 113 (52), pp. E8396–E8405. Cited by: §IIB.
 [23] (2018) Asymptotically optimal minimizers schemes. Bioinformatics 34 (13), pp. i13–i22. Cited by: §IIB.
 [24] (2012) Realtime pcr as a diagnostic tool for bacterial diseases. Expert review of molecular diagnostics 12 (7), pp. 731–754. Cited by: §I.
 [25] (2017) Chromatin accessibility prediction via convolutional long shortterm memory networks with kmer embedding. Bioinformatics 33 (14), pp. i92–i101. Cited by: §IIA.
 [26] (2005) The fragment assembly string graph. Bioinformatics 21 (suppl_2), pp. ii79–ii85. Cited by: §IIB.

[27]
(2017)
Graph2vec: learning distributed representations of graphs
. arXiv preprint arXiv:1707.05005. Cited by: §I, 2nd item, §IVA.  [28] (2017) Deep learning for metagenomic data: using 2d embeddings and convolutional neural networks. arXiv preprint arXiv:1712.00244. Cited by: §IIA.
 [29] (2001) An eulerian path approach to dna fragment assembly. Proceedings of the national academy of sciences 98 (17), pp. 9748–9753. Cited by: §IIB.
 [30] (2019) Latent space modeling for cloning encrypted pufbased authentication. In IFIP International Internet of Things Conference, pp. 142–158. Cited by: §I.
 [31] (2008) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §I.
 [32] (2014) An introduction to the analysis of shotgun metagenomic data. Frontiers in Llant Science 5, pp. 209. Cited by: §I.
 [33] (2011) Weisfeilerlehman graph kernels.. Journal of Machine Learning Research 12 (9). Cited by: 2nd item, §IVA1, 2nd item.
 [34] (2009) Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pp. 488–495. Cited by: §IVA1, 3rd item.
 [35] (2020) GraKeL: a graph kernel library in python.. Journal of Machine Learning Research 21 (54), pp. 1–5. Cited by: §I.
 [36] (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §IVB2.
 [37] (2016) A bacterial analysis platform: an integrated system for analysing bacterial whole genome sequencing data for clinical diagnostics and surveillance. PloS one 11 (6), pp. e0157718. Cited by: §I.
 [38] (2018) Integrating longrange connectivity information into de bruijn graphs. Bioinformatics 34 (15), pp. 2556–2565. Cited by: §IIB.
 [39] (2010) Graph kernels. The Journal of Machine Learning Research 11, pp. 1201–1242. Cited by: §IVA1.
 [40] (2018) An endtoend deep learning architecture for graph classification. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §IVA.
 [41] (2017) Networkbased machine learning and graph theory algorithms for precision oncology. NPJ precision oncology 1 (1), pp. 1–15. Cited by: §I.
Comments
There are no comments yet.