Applied data analysis has seen explosive growth in popular interest in recent years. In parallel with this growth, there has been an increased interest in “democratizing” the tools and techniques for use by communities who might not have access otherwise. In many ways the machine learning problems faced by such communities are identical to the needs of larger institutions (i.e. return on investment optimization is technically similar regardless of the tax status of the company). Yet, significant shortage of skilled workers has created salary requirements that prevent nonprofits from competing. This reflects a broader trend in the nonprofit sector struggling to find IT talent. Exacerbating this problem are the struggles associated with trying to integrate a machine learning expert/team into an existing business structure or software stack. To counteract this trend organizations for several years have made great strides both in training new individuals perform data analysis and encouraging advanced practitioners (e.g. PhD graduates in Computer Science or Statistics) to pursue this work. The Data-for-Good community (such as Data Science for Social Good fellowship and the IBM Social Good Fellowship) has achieved remarkable success in promoting work directly with not-for-profit organizations, non-government organizations, or local governments. Here we are encouraging machine learning practitioners to consider the analysis of public biological datasets with a focus on problems that would have the greatest value to medical facilities in underserved communities. Using our work on genotype to phenotype classification of AMR as a case study, we note how AML techniques can yield real advances in the field. Biology and medicine are rich with a multitude of problems, many of which are very niche big business investments. We hope to attract the attention of the Data-for-Good community to this class of problems which we believe will be an area of growth, both in and out of the Data-for-Good community, over the next decade. Here we present work in progress, which was originally intended to demonstrate to biologists the power AML can have on many problems they are interested in. We believe it also serves as a good case study for the Data-for-Good movement: our use of off-the-shelf classifiers applied to public datasets to answer unique real-world problems shares a great similarity with the type of problems the Data-for-Good community attempts to solve.
The very foundation of computational biology rests on the application of computer science to biological problems. Similarly, the biostatistics community has a long history of investigating the unique statistical problems that biology presents. Well-funded Silicon Valley startups, such as 23andMe, are in the business of collecting large amounts of data and applying statistical techniques typical of applied machine learning or “big data” analysis. Meanwhile, Deep Genomics targets particularly tricky biological problems with both copious data and advanced deep learning techniques. Historically, advances in computational biology have been achieved with well targeted small datasets or datasets which are hard to generate. In contrast, current growth in the field is typified by large scale aggregation of data using various protocols, standards and databases. We believe this aggregation of large datasets will create a unique opportunities for a variety AML solutions and most importantly increased involvement of practitioners.
2.1 Data-for-Good Aspects of AMR
AMR is a well recognized problem with profound implications for global health CDC (US); Organization et al. (2014); House (2015). In many biological disciplines grant application guidelines or funding agencies typically require that the data generated and used for analysis in the course of a study are made publicly available. This allows unique opportunity for large-scale aggregation platforms, such as Pathosystems Resource Integration Center (PATRIC) Gillespie et al. (2011) to serve a unique role. These platforms provide greater access to data but are currently geared toward professional biologists. Some of their pros/cons are:
Ample availability of access
Presents applied and theoretical problems
Easy to misinterpret
Unclear how to operationalize analytical results
We note that the principal disadvantages listed above are mainly related to putting the data into a format that is easily accesible to an AML professional, rather than a professional biologist. We note that these problems are not unique to computational biology and have been surmounted in other domains either through standardization of the data processing or attachment of metadata.
3 AMR Case Study
Identification of regions and genes associated with antimicrobial resistance is an extensively studied topic. While we have analysed a variety of datasets (Table 2), here we focus on principally three different species of bacteria–Streptococcus pneumoniae (-lactams resistant), Acinetobacter baumannii (carbapenems resistant) and Mycobacterium tuberculosis (isoniazid resistant) Chewapreecha et al. (2014).
, logistic regression and deep learning among others) were investigated, however these results are largely beyond the scope of this paper. All analysis was performed using Python 2.7 on a 32 core machine with 1TB RAM.
3.2 Data Preprocessing
Assembled DNA contigs (partially assembled medium length strands of DNA that represent each isolate) were converted to -mers (fixed length short strands typically of length of ). Each isolate is represented by its unique -mer count as features. Table 1 shows how the choice of effects -mer count and final dataset size. We have two variables we can tune–the first is the size of , the second is the number of isolates to consider. Setting is equivalent to finding the cellular GC-content (i.e. the percent of guanine (G) or cytosine (C) nitrogenous bases on the DNA molecule), which is used for coarse-grained analysis in computational biology. We note that overall size of the dataset matrix expands rapidly at roughly around (Figure 1). In addition, as demonstrated in Figure 2, a larger yields a better ROC metric. Conversely, on another dataset (Figure 3) we note that both RF and Lasso perform fairly similarly regardless of the size of . We believe further investigation of the generalizability across species and phenotypes is necessary and have noted when our analysis included -mers of different lengths.
3.3 Overview of Results
RF performs remarkably well in general. We predict AMR phenotypes using an train/test split, obtaining accuracies as high as . Figure 5 shows that the accuracies of the algorithm decrease as a function of the number of isolates trained on, while Figure 6 shows that ultimately the accuracy plateaus on larger sample sizes. Despite lower accuracies, Figure 4
shows that the top ranked gene regions show stability even at a lower accuracy. Smaller datasets may show accuracies that are concerning, yet they may still be accurate enough to identify key locations of mutations that confer phenotype. Further tuning of the algorithm, preprocessing the data or feature selection would increase accuracy but would distract from our central thesis, which is chiefly that AML techniques can rapidly be applied to open and real world biological problems.
We have transformed the problem of genotype to phenotype classification into a supervised machine learning task in order to exploit numerous algorithms and techniques available for classification. We believe the above demonstrates that the performance of AML techniques can be suitably used to investigate biological functions involved in AMR and present some opportunities for such analysis below. These corespond to differing objectives which we list below:
Objective I, Maximize classification accuracy: Maximizing classification accuracy is incredibly important for long-term deep investigations of genotypic changes in relationship to phenotype. This approach offers great value when considering stable mutations and would allow for the identification of a genotypic fingerprint (i.e a unique genetic string indicative of a phenotype) of a single type of AMR. This insight could be used for construction of a biological assay (a devices that provides rapid diagnosis in the field). A large collection of genetically similar isolates undergoing a similar mutation conferring resistance would be required, but a nuanced understanding of what defines resistance would be achieved. This approach is perhaps most similar to the ”leaderboard” competitions for AML. Example use: design of an extremely accurate test for wide spread genetically similar outbreaks.
Objective II, Maximize generalization accuracy: Generalization of classifiers for AMR phenotypes can be managed across different bacteria for the same antibiotic or across antibiotics for the same bacteria. It is known that antibiotics affect common genes and gene functions in bacteria, yet little is known about the generalization of the AMR genotype across different bacteria. Example use: designing a general purpose screening for AMR genotypes in environmental samples or designing an algorithmic tool for prioritizing first-line antibiotics in a new outbreak.
Objective III, Aggregate feature selection: Feature selection provides crucial insight into biological processes that confer AMR. The most productive way to cluster features remains an open problem and one that could yield important insight for biologists. In our representation of the problem features are highly redundant and collapsing these redundant features into meaningful clusters, or blocks, provides powerful indication of the region of mutation that confers AMR. Example use: a small fast moving outbreak requires a corse analysis to identify regions of mutation in order to choose an appropriate antibiotic.
We are far from the first to identify computational biology as a productive area of research for AML researchers. There is a concerted effort in the biostatistical community to perform outreach towards biologists. While the separation in data analysis skills and domain knowlege between a typical biologists and a typical statistician is very wide, we believe the AML community and in particular the Data-for-Good community have both the technical and dispositional skills to manage such a divide. Furthermore, we believe that the influx of aggregated datasets will provide both opportunity and need for such professionals.
- Breiman (2001) Breiman, Leo. Random Forests. Machine Learning, 45(1):5–32, October 2001. ISSN 0885-6125. doi: 10.1023/A:1010933404324. URL http://dx.doi.org/10.1023/A%3A1010933404324.
- CDC (US) CDC(US). Antibiotic resistance threats in the United States, 2013. Centres for Disease Control and Prevention, US Department of Health and Human Services, 2013.
- Chewapreecha et al. (2014) Chewapreecha, Claire, Harris, Simon R, Croucher, Nicholas J, Turner, Claudia, Marttinen, Pekka, Cheng, Lu, Pessia, Alberto, Aanensen, David M, Mather, Alison E, Page, Andrew J, Salter, Susannah J, Harris, David, Nosten, Francois, Goldblatt, David, Corander, Jukka, Parkhill, Julian, Turner, Paul, and Bentley, Stephen D. Dense genomic sampling identifies highways of pneumococcal recombination. Nat Genet, 46(3):305–309, March 2014. ISSN 1061-4036. URL http://dx.doi.org/10.1038/ng.2895.
- Freund & Schapire (1997) Freund, Yoav and Schapire, Robert E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
- Gillespie et al. (2011) Gillespie, Joseph J, Wattam, Alice R, Cammer, Stephen A, Gabbard, Joseph L, Shukla, Maulik P, Dalay, Oral, Driscoll, Timothy, Hix, Deborah, Mane, Shrinivasrao P, Mao, Chunhong, et al. Patric: the comprehensive bacterial bioinformatics resource with a focus on human pathogenic species. Infection and immunity, 79(11):4286–4298, 2011.
- House (2015) House, White. National action plan for combating antibiotic-resistant bacteria. Accessed August, 8, 2015.
- Organization et al. (2014) Organization, World Health et al. Antimicrobial resistance: global report on surveillance. World Health Organization, 2014.
- Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.