Deep Multiple Instance Learning for Taxonomic Classification of Metagenomic read sets

09/28/2019 ∙ by Andreas Georgiou, et al. ∙ ETH Zurich 0

Metagenomic studies have increasingly utilized sequencing technologies in order to analyze DNA fragments found in environmental samples. It can provide useful insights for studying the interactions between hosts and microbes, infectious disease proliferation, and novel species discovery. One important step in this analysis is the taxonomic classification of those DNA fragments. Of particular interest is the determination of the distribution of the taxa of microbes in metagenomic samples. Recent attempts using deep learning focus on architectures that classify single DNA reads independently from each other. In this work, we attempt to solve the task of directly predicting the distribution over the taxa of whole metagenomic read sets. We formulate this task as a Multiple Instance Learning (MIL) problem. We extend architectures used in single-read taxonomic classification with two different types of permutation-invariant MIL pooling layers: a) deepsets and b) attention-based pooling. We illustrate that our architecture can exploit the co-occurrence of species in metagenomic read sets and outperforms the single-read architectures in predicting the distribution over the taxa at higher taxonomic ranks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last decades, advancements in sequencing technology have led to a rapid decrease of the cost of genome sequencing [35] while the amount of sequencing data being generated has vastly increased. This is attributable to the fact that genome sequencing is a tool of utmost importance for a variety of fields, such as biology and medicine, where it is used to identify changes in genes or aid in the discovery of potential drugs [23, 28]. Metagenomics is a subfield of biology, which is concerned with the study of genetic material found in samples taken directly from the environment [8, 15]. DNA fragments found in those samples can be sequenced using various sequencing technologies, such as Illumina, PacBio, and Oxford Nanopore [29]. This process results in substrings sampled from random positions in the genomes of the organisms, called DNA reads. The reads obtained from sequencing are noisy, meaning that some of the letters (called base pairs) are flipped to a different letter or, in some cases, the deletion or insertion of additional base pairs can occur. The error rate and the distribution of the noise is dependent on the technology used to sequence the DNA fragments [29]. Newer long-read technologies can sequence complete genomes of viruses and small bacteria, but with a higher error rate [18].

As an application of metagenomic sequencing, samples can be taken from the human intestine in order to characterize the microbial flora of the human gut [23, 28]. Significant efforts have been carried out by projects such as the Human Microbiome Project (HMP) [23] and the Metagenomics of the Human Intestinal Tract (MetaHIT) project [28] in order to understand how the human microbiome can have an effect on human health. An important step in this process is to classify DNA fragments into various groups at different taxonomic ranks. The NCBI Taxonomy maintains a tree ontology of taxonomic labels [36]. Organisms are assigned taxonomic labels and thus are placed on the tree. Each level of the tree represents a different taxonomic rank, with finer ranks such as species and genus being close to the leaf nodes and coarser ranks such as phylum and class closer to the root.

One approach that has shown great promise for biological classification tasks is deep learning. In recent years, we have seen various attempts of using deep learning to solve tasks such as variant calling [27] or the discovery of DNA-binding motifs [39]. These methods even outperform more classical approaches, despite the relative lack of biological prior knowledge incorporated into those models.

We consider the problem of metagenomic classification, where each individual read is assigned to a label or multiple labels corresponding to its taxon at each taxonomic rank. One could simply identify the taxon at the finest level of the taxonomy and then extract the taxa at all levels of the tree above it by following the path to the root. The problem with this approach is that for certain reads, we might not be able to accurately identify the species of the host organism, but nevertheless be interested in coarser taxonomic ranks. This can apply in cases where little relevant reference data is available for a sequencing dataset (such as deep sea metagenomics data [33] or New York City metagenomics where only of samples matched a known species [1]), so a more accurate prediction at higher taxonomic ranks may be more informative for downstream analysis [30]. Furthermore, in many cases we are only interested in the distribution of organisms in an environmental sample, also known as the microbiota, rather than in the classification of individual fragments.

We formulate this task as an instance of Multiple Instance Learning (MIL). MIL is a specific framework of supervised learning approaches. In contrast to the traditional supervised learning task, where the goal is to predict a value or class for each sample, in MIL, given a set of samples, the goal is to assign a value to the whole set. A set of items is called a

bag, whereas each individual item in the bag is called an instance. In other words, a bag of instances is considered to be one data point [10]. More formally, a bag is a function where is the space of instances. Given an instance , counts the number of occurrences of in the bag . Let be the class of such bag functions. Then the goal of a MIL model is to learn a bag-level concept where is the space of our target variable.

In the context of metagenomic classification, we consider the instances to be DNA reads. Our goal is to directly predict the distribution over a given set of taxonomic ranks in the read set (the bag). So for each taxon, our output is a real number in denoting the portion of the reads in the read set that originated from that particular species. The motivation for this is that in a realistic set of reads, closely related organisms tend to appear together. It might thus be possible to exploit the co-occurrence of organisms to gain better accuracy [6].

Our main contributions are:

  • A new method to generate synthetic read sets with realistic co-occurrence patterns from collections of reference genomes.

  • A novel machine learning model for predicting the distribution over taxa in a read set, combining state-of-the-art deep DNA classification models with read-set-level aggregation in a multiple instance learning setting.

  • A thorough empirical assessment of our proposed model, showing superior performance in predicting the distributions of higher level taxa from read sets.

In the rest of this paper, we give an overview of previous related work in Section 2, describe our data generation method and machine learning models in Section 3 and analyse the results of our experiments in Section 4. An overview of our proposed architectures is depicted in Figure 1.

(a) GeNet + MIL pooling
(b) EmbedPool + MIL pooling
Figure 1: The two proposed architectures for solving the MIL task. The models can process multiple reads (only two reads shown for compactness) independently from each other. During the MIL pooling phase, the outputs for each read are combined to create a representation for the whole read set. Subsequently, the model can use this to directly predict the distribution over the taxa.

2 Related Work

To solve the problem of metagenomic classification, more traditional methods rely on read alignment to classify each DNA fragment. Given a DNA read, one first needs to match -mers to a large database of reference genomes. This is done to detect candidate segments of the genomes and can be executed quickly by first creating an index of the reference genomes during a preprocessing phase [2, 3, 24]. Following this step, one needs to use approximate string matching techniques to match the string to the candidate segments determined by the

-mer matching step before. A well-known and widely used tool that uses alignment is BLAST, which is a general heuristic tool for aligning genomic sequences. Other alignment and mapping based tools specifically designed for metagenomics include Centrifuge

[19], Kraken [37], MetaPhlAn [32], and MEGAN [16]. These methods make trade-offs of sensitivity for scalability. For example, BLAST is highly sensitive, but not scalable to databases of unassembled sequencing data, while more approximate methods like Kraken are well suited for such large databases. Moreover, recent deep learning approaches have outperformed these methods by significant margins, especially in high error-rate settings [30, 21].

Most of the previous attempts using machine learning focused on 16S rRNA sequences due to their high sequence conservation across a wide range of species. An example is the RDP (Ribosomal Database Project) classifier which uses a Naive Bayes classifier to classify 16S rRNA sequences

[34]. The disadvantage of this method is the loss of positional information due to the encoding of the sequence as a ‘bag’ of 8-letter words. However, the generalizability of this model to sequencing data drawn from other genomic regions is unclear. Similarly, [20] use probabilistic topic modeling was used in order to classify 16S rRNA sequences in the taxonomic ranks from phylum to family. Another interesting approach is taken by [4]

which uses Markov models to classify DNA reads and can even be combined with alignment methods to increase performance. In addition,

[5]

use a CNN architecture to classify 16S sequences, while other approaches also proposed to use recurrent neural networks on sequences

[12].

More recent attempts for solving the general metagenomic classification problem focus on using deep learning to tackle it as a supervised classification task. Two examples of such attempts are GeNet [30], which attempts to leverage the hierarchical nature of taxonomic classification, and DeepMicrobes [21], which first learns embeddings of -mers and subsequently uses those to classify each read. We use GeNet and a simplified version of DeepMicrobes as baselines and explain them in more detail in Section 3.

3 Models and Methods

We implemented two deep neural networks for predicting the taxa of individual reads which we use as baselines: GeNet [30] and a simplified version of DeepMicrobes [21], described in sections 3.2.1 and 3.2.2 respectively. We refer to those models collectively as single-read models and we extend those in order to solve the MIL problem described above.

The full source code is provided online at https://github.com/MetagenomicMIL/MetaSetMIL.

3.1 Dataset generation

For training, validation and evaluation, we use synthetic reads generated from bacterial genomes from the NCBI RefSeq database [36] from which we use a subset of genomes comprising species similar to the dataset used in [30]. We use NCBI’s Entrez tool [31], to download the genomes and the taxonomic data. The number of taxa in each taxonomic rank is summarized in Table S1 in Appendix A.

For training the single-read models, we create mini-batches in which the reads are sampled by selecting genomes uniformly at random. Training of the MIL models is different where a batch consists of a small number of bags of reads, with each bag containing reads sampled using a more realistic distribution over the genomes. The procedure used is similar to the one used by the CAMISIM simulator [11] and described in more detail in Section 3.1.1. An example rank-abundance curve for each taxonomic rank generated by this procedure is shown in Figure 2.

Figure 2: Rank-abundance curve for each taxonomic rank. All taxa are sorted using their abundance. Their abundance level is shown on the -axis.

From the selected genomes, we sample reads to create mini-batches in an iterative procedure similar to the procedure described in [30]. For the generation of reads, we use the software InSilicoSeq [14]. We create datasets of two types in order to carry out our experiments: 151 bp reads (default length of InSilicoSeq) with no errors and with Illumina NovaSeq type noise. We refer to those two types of datasets as error-free and novaseq, respectively. In our experiments, we train all models on both dataset types. For validation and evaluation we only use datasets of novaseq-type reads in order to determine whether the models are effective at removing noise from the reads and whether it is beneficial to train with noisy reads.

Every bag is supposed to simulate a different microbial community and hence the generation procedure is repeated for each bag. The more realistic bags allow the MIL models to capture the interactions between the reads coming from related species and capture potential overlap in the reads originating from the same taxa. The validation and evaluation datasets for both single-read models and the MIL models use this more realistic approach. Hyperparameter search was also performed for all models (details on the exact parameters can also be found in Appendix

B).

3.1.1 Sampling a realistic set of reads

In order to sample bags with a more realistic community of bacteria, we use a method similar to [11]. Given a set of all the taxa at a higher level (e.g., genus or family), we sample numbers from a lognormal distribution with and :

(1)

Then, for a taxon with genomes associated with it, we choose to include in our microbial community only random genomes where

is sampled from a geometric distribution with

:

(2)

To calculate the abundance of a genome belonging to taxon , random numbers are sampled from a lognormal distribution as in equation (1). The abundance for the genome is then calculated as:

(3)

All abundances are finally normalized to produce a probability vector over all the genomes in the dataset. When sampling a read set, a genome is selected by sampling from the distribution produced. Reads are then simulated from the genome sample using the software package

InSilicoSeq.

3.2 Baseline machine learning models

3.2.1 GeNet

GeNet leverages the hierarchical nature of the taxonomy of species to simultaneously classify DNA reads at all taxonomic ranks [30]. The procedure is similar to positional embedding as described in [13]. Given an input , an embedding is computed, where . The vocabulary of size corresponds to the symbols for the four possible nucleotides A, C, T, G, and N (for unknown base pairs in the read). Embeddings of the absolute positions for each letter are also computed to create , where . The one-hot representation of the sequence, , is added to the other two embeddings to create the matrix

. Subsequently, the resulting matrix is passed to a ResNet-like neural network which produces a final low-dimensional representation of the read. The main novelty of the architecture is the final layer used for classification which comprises multiple softmax layers, one for each taxonomic rank. These layers are connected to each other so that information from higher ranks can be propagated towards the lower ranks. More formally, the output of softmax layer

can be written as follows:

(4)

where and are trainable parameters, is the output of the ResNet network and is the previous softmax output.

is the rectified linear unit function. To train the model, an averaged cross-entropy loss for each softmax layer is used.

3.2.2 EmbedPool

[21] introduce multiple architectures for performing single-read classification among which the best is DeepMicrobes. It involves embedding -mers (

) into a latent representation, followed by a bidirectional LSTM, a self-attention layer, and a multi-layer perceptron (MLP). Unlike

GeNet, this model can only be trained to classify a single taxonomic rank. Due to limited computational resources (the model requires a significant amount of GPU memory because of the very large embedding matrix), we implemented EmbedPool, a simpler version of DeepMicrobes (also described in the original paper) to use as a baseline. In order to classify at multiple taxonomic ranks, one could run multiple instances of the model, each running on a different GPU. However, each model would be independent of the others and they would not take advantage of the hierarchical structure of the taxonomic tree. EmbedPool is a model that consists of an embedding layer for -mers, where we set in order to fit it into GPU memory. Both max- and mean-pooling are performed on the resulting matrix and concatenated together to yield a low-dimensional representation of the read. Since the embedding dimension is set to , after concatenation, this results in a vector of size . An MLP with one hidden layer of

units subsequently classifies the read. ReLU is used as the activation function. As the authors explained, most of the performance is owed to the use of the

-mer embedding and therefore the reduction in performance relative to DeepMicrobes is not expected to be significant. The model is trained end-to-end using cross-entropy loss.

3.3 Proposed multiple instance learning models

3.3.1 GeNet + MIL pooling

A mini-batch of bags of reads is used as input. The first part of GeNet, consisting of the embedding and the ResNet-like neural network, is used to process each read individually. A pooling layer is then used to group all reads in each bag to create bag-level embeddings. This is also referred to as MIL pooling [10, 6]. The output is passed to the final layers of GeNet

in order to output a probability distribution over the taxa at each taxonomic rank. As a loss function we use the Jensen-Shannon (

) divergence [22] between the predicted distribution and the actual distribution of the bag.

Given that a bag is a set, we require that a MIL pooling layer is permutation invariant, that is, permuting the reads of the bag should still produce the same result. To this end, we utilize DeepSets [38]. DeepSets can be formally described as follows:

(5)

In other words, each element of a set is first processed by a function . The outputs are all summed together and the result is subsequently transformed by a function . [38] proved that all valid functions operating on subsets of countable sets or on fixed-sized subsets of uncountable sets can be written in this form. In our case, the inputs are embeddings in where is the length of a read. In addition, we only input bags of fixed size and hence the assumptions of Theorem 2 in [38] are satisfied. is modelled with a small MLP with one hidden layer while the ResNet part of the network models the function .

Alternatively to DeepSets, we also consider an attention-based pooling layer as seen in [17] motivated by the fact that it would allow the model to attend to specific reads originating from each species. In attention-based pooling, the elements of the set are combined in different ways to create a set , such that the set remains invariant when we permute the elements of the input set. This can be written as follows:

(6)
(7)

where is an element of the input set, and and are trainable parameters. The weights are therefore calculated with an MLP with hidden layer with non-linearity and activation at the end. [17] also attempt to increase the flexibility of the MIL pooling by introducing a gating mechanism as shown below:

(8)

where is an additional learnable matrix, is the sigmoid activation function and is the element-wise product. As shown in Appendix B, for our models, using the gating mechanism is an additional hyperparameter. Following the attention mechanism, the output is flattened to create a single vector for each bag which is subsequently processed by GeNet’s final layers to output the predicted distributions. The overall architecture can be seen in Figure 0(a).

3.3.2 EmbedPool + MIL pooling

Similarly to subsection 3.3.1, we use EmbedPool

to process the reads individually. A MIL pooling layer is added after the mean- and max- pooling layers, the output of which is fed to the rest of the model to predict the distribution. JS-divergence is used as a loss function. For MIL pooling, we use DeepSets and attention-based pooling as before. An overview of the model can again be seen in Figure

0(b).

4 Results and Discussion

In this section, we analyze the results of the two baselines on solving the single-read prediction task. Then we evaluate their performance on the proposed MIL task and compare them to our MIL models. Table 1 illustrates the performance of the models trained on novaseq and error-free reads. However, in both cases the models are evaluated on novaseq reads in order to test their robustness to noise.

4.1 Single-read predictions

In [30], GeNet was trained on PacBio reads of length  bp and Illumina reads of length  bp. Since in most cases genome sequencing technologies like Illumina produce shorter reads in the range of 100 bp - 300 bp [29], we chose to train all our models on reads of length 151 bp. In the single-read prediction task GeNet does not perform very well on our evaluation dataset neither at Phylum nor Species

levels. This is attributed to the fact that it might be unable to extract useful features shared across the whole genome from shorter reads, especially because one-hot encoding is used rather than

-mer encoding. On the other hand, even though EmbedPool seemed to be performing well during training, achieving training accuracy of , when the distribution of the reads in the mini-batch is changed (as is the case with our more realistic evaluation dataset), the accuracy drops to . This signifies that EmbedPool is not able to accurately classify all species equally well. GeNet however seems to be more robust to the change of the mini-batch distribution since the accuracy does not drop when moving from the training dataset to the more realistic evaluation dataset. In addition, training with noisy reads seems to not improve results for EmbedPool when evaluating the classifier on noisy reads. However, training with error-free reads seems to achieve better results for GeNet even when evaluating on novaseq reads. A table with the accuracy achieved by both baselines in the single-read prediction task can be found in Appendix A.

4.2 Read-set-based predictions

width=center novaseq error-free Phylum Family Species Phylum Family Species GeNet [30] EmbedPool [21] N/A N/A N/A N/A GeNet + Deepset (ours) GeNet + Attention (ours) Embedpool + Deepset (ours) N/A N/A N/A N/A Embedpool + Attention (ours) N/A N/A N/A N/A

Table 1: Performance () of all models trained on each dataset (higher is better). Our MIL models achieve superior performance at higher taxonomic ranks up to Family. EmbedPool was only trained at the Species level since training time exceeded our cluster limits. See subsection 4.2 for more details.

In each taxonomic rank , the upper bound for the JS-divergence differs because of different numbers of taxa belonging to that rank. Therefore, we normalize our results and use as the metric for comparison, where a value of means the model achieved perfect performance. Table 1 shows a comparison of our MIL models and the achieved scores. A table of the raw values can be found in Appendix A. For the standard GeNet and EmbedPool, the microbiota distribution was calculated by classifying each read independently while for the rest, the distribution was predicted directly by the models. An example of the output of the MIL models is shown in Figure 3. All models were evaluated on a total of 100 bags of 2048 novaseq-type reads each. Both GeNet + Deepset and GeNet + Attention perform better than standard GeNet at higher taxonomic ranks. As explained in Section 1, we believe that the improvement in accuracy is owed to the fact that the models can exploit the co-occurrence of species in realistic settings or detect overlaps of reads in a bag. A drawback of our MIL models is that, since the performance is owed to the special structure of the bags, it is unlikely that they would perform well when presented with bags with an unrealistic distribution of species (e.g., a bag with a uniformly random distribution over all species). Therefore, it is clear that the models achieve a trade-off between flexibility and performance. Moreover, our proposed MIL models perform poorly on the finer taxonomic ranks, possibly because in the MIL setting, the models only observe a summary of the bag rather than a label for each instance and it is therefore harder for them to learn adequate features. However, the greater performance on higher levels can prove beneficial for some real-world metagenomic datasets where sufficient reference data is not available to train deep learning models accurately [1, 33]. A comparison of GeNet + Deepset, our best performing model and standard GeNet can be seen in Figure S1 in Appendix A.

Figure 3: Distribution of taxa at the class rank. The target distribution is denoted in orange and the output of the model is denoted in blue.

5 Conclusions

In this work, we tackle the problem of directly predicting the distribution of the microbiota in metagenomic samples. In contrast to previous methods that are based on classifying single reads, we formulate the problem as a Multiple Instance Learning task and use permutation invariant pooling layers in order to learn low-dimensional embeddings for whole sets of reads. We show that our proposed method can perform better than the baseline models at the higher taxonomic ranks. The MIL models presented could be used as an initial step to filter or preselect the potential genomes that more traditional alignment methods would need to take as input in order to increase their performance.

Further work could include exploring alternative base architectures or more sophisticated pooling methods that can better capture the interactions between reads. For example, one could use Janossy pooling [25], another permutation invariant method that can capture -order interactions between the elements of a set. Also, the models could potentially be combined with a probabilistic component, such as a Gaussian process over DNA sequences [9]

, to allow for uncertainty estimates on the predictions. Finally, as explained, a possible issue is that observing only the summary of the read set can make it more difficult for the model to learn adequate features for the individual reads. A solution to this could be to first learn better instance-level embeddings to use as input, in order to aid the model in learning suitable bag-level embeddings.

References

  • [1] E. Afshinnekoo, C. Meydan, S. Chowdhury, D. Jaroudi, C. Boyer, N. Bernstein, J. M. Maritz, D. Reeves, J. Gandara, S. Chhangawala, et al. (2015) Geospatial resolution of human and bacterial diversity with city-scale metagenomics. Cell systems 1 (1), pp. 72–87. Cited by: §1, §4.2.
  • [2] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman (1990) Basic local alignment search tool. Journal of molecular biology 215 (3), pp. 403–410. Cited by: §2.
  • [3] A. Bowe, T. Onodera, K. Sadakane, and T. Shibuya (2012) Succinct de bruijn graphs. In International Workshop on Algorithms in Bioinformatics, pp. 225–235. Cited by: §2.
  • [4] A. Brady and S. L. Salzberg (2009)

    Phymm and phymmbl: metagenomic phylogenetic classification with interpolated markov models

    .
    Nature methods 6 (9), pp. 673. Cited by: §2.
  • [5] A. Busia, G. E. Dahl, C. Fannjiang, D. H. Alexander, E. Dorfman, R. Poplin, C. Y. McLean, P. Chang, and M. DePristo (2019)

    A deep learning approach to pattern recognition for short dna sequences

    .
    bioRxiv, pp. 353474. Cited by: §2.
  • [6] M. Carbonneau, V. Cheplygina, E. Granger, and G. Gagnon (2018) Multiple instance learning: a survey of problem characteristics and applications. Pattern Recognition 77, pp. 329–353. Cited by: §1, §3.3.1.
  • [7] C. Y. Chiu and S. A. Miller (2019) Clinical metagenomics. Nat Rev Genet 20, pp. 341–355. Cited by: Deep Multiple Instance Learning for Taxonomic Classification of Metagenomic read sets.
  • [8] M. I. Consortium et al. (2016) The metagenomics and metadesign of the subways and urban biomes (metasub) international consortium inaugural meeting report. Springer. Cited by: §1.
  • [9] V. Fortuin, G. Dresdner, H. Strathmann, and G. Rätsch (2018) Scalable gaussian processes on discrete domains. arXiv preprint arXiv:1810.10368. Cited by: §5.
  • [10] J. Foulds and E. Frank (2010) A review of multi-instance learning assumptions.

    The Knowledge Engineering Review

    25 (1), pp. 1–25.
    Cited by: §1, §3.3.1.
  • [11] A. Fritz, P. Hofmann, S. Majda, E. Dahms, J. Dröge, J. Fiedler, T. R. Lesker, P. Belmann, M. Z. DeMaere, A. E. Darling, et al. (2019) CAMISIM: simulating metagenomes and microbial communities. Microbiome 7 (1), pp. 17. Cited by: §3.1.1, §3.1.
  • [12] S. Ganscha, V. Fortuin, M. Horn, E. Arvaniti, and M. Claassen (2018) Supervised learning on synthetic data for reverse engineering gene regulatory networks from experimental time-series. bioRxiv, pp. 356477. Cited by: §2.
  • [13] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1243–1252. Cited by: §3.2.1.
  • [14] H. Gourlé, O. Karlsson-Lindsjö, J. Hayer, and E. Bongcam-Rudloff (2018) Simulating illumina metagenomic data with insilicoseq. Bioinformatics 35 (3), pp. 521–522. Cited by: §3.1.
  • [15] A. C. Howe, J. K. Jansson, S. A. Malfatti, S. G. Tringe, J. M. Tiedje, and C. T. Brown (2014) Tackling soil diversity with the assembly of large, complex metagenomes. Proceedings of the National Academy of Sciences 111 (13), pp. 4904–4909. Cited by: §1.
  • [16] D. H. Huson, A. F. Auch, J. Qi, and S. C. Schuster (2007) MEGAN analysis of metagenomic data. Genome research 17 (3), pp. 377–386. Cited by: §2.
  • [17] M. Ilse, J. M. Tomczak, and M. Welling (2018) Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712. Cited by: §3.3.1, §3.3.1.
  • [18] M. Jain, H. E. Olsen, B. Paten, and M. Akeson (2016) The oxford nanopore minion: delivery of nanopore sequencing to the genomics community. Genome biology 17 (1), pp. 239. Cited by: §1.
  • [19] D. Kim, L. Song, F. P. Breitwieser, and S. L. Salzberg (2016) Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome research 26 (12), pp. 1721–1729. Cited by: §2.
  • [20] M. La Rosa, A. Fiannaca, R. Rizzo, and A. Urso (2015) Probabilistic topic modeling for the analysis and classification of genomic sequences. BMC bioinformatics 16 (6), pp. S2. Cited by: §2.
  • [21] Q. Liang, P. W. Bible, Y. Liu, B. Zou, and L. Wei (2019) DeepMicrobes: taxonomic classification for metagenomics with deep learning. bioRxiv, pp. 694851. Cited by: Table S3, §2, §2, §3.2.2, §3, Table 1.
  • [22] J. Lin (1991) Divergence measures based on the shannon entropy. IEEE Transactions on Information theory 37 (1), pp. 145–151. Cited by: §3.3.1.
  • [23] B. A. Methé, K. E. Nelson, M. Pop, H. H. Creasy, M. G. Giglio, C. Huttenhower, D. Gevers, J. F. Petrosino, S. Abubucker, J. H. Badger, et al. (2012) A framework for human microbiome research. nature 486 (7402), pp. 215. Cited by: Deep Multiple Instance Learning for Taxonomic Classification of Metagenomic read sets, §1, §1.
  • [24] M. D. Muggli, A. Bowe, N. R. Noyes, P. S. Morley, K. E. Belk, R. Raymond, T. Gagie, S. J. Puglisi, and C. Boucher (2017) Succinct colored de bruijn graphs. Bioinformatics 33 (20), pp. 3181–3187. Cited by: §2.
  • [25] R. L. Murphy, B. Srinivasan, V. Rao, and B. Ribeiro (2018) Janossy pooling: learning deep permutation-invariant functions for variable-size inputs. arXiv preprint arXiv:1811.01900. Cited by: §5.
  • [26] S. Nayfach, Z. J. Shi, R. Seshadri, K. S. Pollard, and N. C. Kyrpides (2019) New insights from uncultivated genomes of the global human gut microbiome. Nature 568 (7753), pp. 505. Cited by: Deep Multiple Instance Learning for Taxonomic Classification of Metagenomic read sets.
  • [27] R. Poplin, P. Chang, D. Alexander, S. Schwartz, T. Colthurst, A. Ku, D. Newburger, J. Dijamco, N. Nguyen, P. T. Afshar, et al. (2018) A universal SNP and small-indel variant caller using deep neural networks. Nature biotechnology 36 (10), pp. 983. Cited by: §1.
  • [28] J. Qin, R. Li, J. Raes, M. Arumugam, K. S. Burgdorf, C. Manichanh, T. Nielsen, N. Pons, F. Levenez, T. Yamada, et al. (2010) A human gut microbial gene catalogue established by metagenomic sequencing. nature 464 (7285), pp. 59. Cited by: Deep Multiple Instance Learning for Taxonomic Classification of Metagenomic read sets, §1, §1.
  • [29] M. A. Quail, M. Smith, P. Coupland, T. D. Otto, S. R. Harris, T. R. Connor, A. Bertoni, H. P. Swerdlow, and Y. Gu (2012) A tale of three next generation sequencing platforms: comparison of ion torrent, pacific biosciences and illumina miseq sequencers. BMC genomics 13 (1), pp. 341. Cited by: §1, §4.1.
  • [30] M. Rojas-Carulla, I. Tolstikhin, G. Luque, N. Youngblut, R. Ley, and B. Schölkopf (2019) GeNet: deep representations for metagenomics. arXiv preprint arXiv:1901.11015. Cited by: Table S1, Table S3, §1, §2, §2, §3.1, §3.1, §3.2.1, §3, §4.1, Table 1.
  • [31] G. D. Schuler, J. A. Epstein, H. Ohkawa, and J. A. Kans (1996) [10] entrez: molecular biology database and retrieval system. In Methods in enzymology, Vol. 266, pp. 141–162. Cited by: §3.1.
  • [32] N. Segata, L. Waldron, A. Ballarini, V. Narasimhan, O. Jousson, and C. Huttenhower (2012) Metagenomic microbial community profiling using unique clade-specific marker genes. Nature methods 9 (8), pp. 811. Cited by: §2.
  • [33] B. J. Tully, E. D. Graham, and J. F. Heidelberg (2018) The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans. Scientific data 5, pp. 170203. Cited by: §1, §4.2.
  • [34] Q. Wang, G. M. Garrity, J. M. Tiedje, and J. R. Cole (2007) Naive bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73 (16), pp. 5261–5267. Cited by: §2.
  • [35] K. A. Wetterstrand (2013) DNA sequencing costs: data from the NHGRI genome sequencing program (GSP). Cited by: §1.
  • [36] D. L. Wheeler, T. Barrett, D. A. Benson, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church, M. DiCuccio, R. Edgar, S. Federhen, et al. (2006) Database resources of the national center for biotechnology information. Nucleic acids research 35 (suppl_1), pp. D5–D12. Cited by: Appendix A, §1, §3.1.
  • [37] D. E. Wood and S. L. Salzberg (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology 15 (3), pp. R46. Cited by: §2.
  • [38] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep sets. In Advances in neural information processing systems, pp. 3391–3401. Cited by: §3.3.1.
  • [39] J. Zou, M. Huss, A. Abid, P. Mohammadi, A. Torkamani, and A. Telenti (2018) A primer on deep learning in genomics. Nature genetics, pp. 1. Cited by: §1.

Appendix A Supplementary material

To train and test our models, we have downloaded genomes from the NCBI RefSeq database [36]. The full list of accession numbers for the genomes used in our dataset can be found in our GitHub repository (https://github.com/MetagenomicMIL/MetaSetMIL).

Rank # of taxa
Phylum 37
Class 77
Order 167
Family 349
Genus 824
Species 1862
Table S1: Number of taxa per rank in our dataset. The selected accession numbers are a subset of the dataset used by [30]. See subsection 3.1

The accuracy of the two baseline models at solving the single-read prediction task was evaluated and the results are shown in Table S2.

novaseq error-free
Phylum Species Phylum Species
GeNet
EmbedPool N/A N/A
Table S2: Accuracy of the two base models trained on each dataset (higher is better).

Subsequently, all models were evaluated on solving the MIL task. The JS-divergence achieved by all models is shown in Table S3 while a comparison of our best performing model, GeNet + Deepset, and GeNet is depicted in Figure S1.

width=center novaseq error-free Phylum Family Species Phylum Family Species GeNet [30] EmbedPool [21] N/A N/A N/A N/A GeNet + Deepset (ours) GeNet + Attention (ours) Embedpool + Deepset (ours) N/A N/A N/A N/A Embedpool + Attention (ours) N/A N/A N/A N/A

Table S3: JS-divergence for all models trained on each dataset. Our MIL models achieve superior performance at higher taxonomic ranks up to Family. See subsection 4.2 for more details.
Figure S1: Performance comparison of GeNet vs GeNet + Deepset. GeNet + Deepset achieves superior performance on taxonomic ranks upto Family.

Appendix B Hyperparameter grid for the trained models.

To train our models, we performed random search over the following hyperparameter grid:

General parameters for single read models
Batch Size 64, 128, 256, 512, 1024, 2048
General parameters for MIL models
Bag Size 64, 128, 512, 1024, 2048
Batch Size 1, 2, 4, 8
GeNet
Output size of ResNet 128, 256, 512, 1024
Use GeNet initialization scheme True, False
BatchNorm running statistics True, False
Optimizer Adam, SGD
Learning rate 0.001, 0.0005, 1.0 (for SGD)
Nesterov momentum (SGD only) 0.0, 0.9, 0.99
EmbedPool
Size of MLP hidden layer 1000, 3000
Optimizer

Adam, RMSprop, SGD

Nesterov momentum (SGD only) 0.0, 0.5, 0.9, 0.99
Learning rate 0.001, 0.0005
Deepset pooling layer
Deepset hidden layer size 128, 256, 1024
Deepset output size 128, 1024
Dropout before network 0.0, 0.2, 0.5, 0.8
Deepset activation ReLU, Tanh, ELU
Attention pooling layer
Hidden layer size 128, 256, 512, 1024
Gated attention False, True
Attention rows 1, 10, 30, 60
Table S4: Hyperparameter grid