Random Fragments Classification of Microbial Marker Clades with Multi-class SVM and N-Best Algorithm

04/19/2019
by   Jingwei Liu, et al.
0

Microbial clades modeling is a challenging problem in biology based on microarray genome sequences, especially in new species gene isolates discovery and category. Marker family genome sequences play important roles in describing specific microbial clades within species, a framework of support vector machine (SVM) based microbial species classification with N-best algorithm is constructed to classify the centroid marker genome fragments randomly generated from marker genome sequences on MetaRef. A time series feature extraction method is proposed by segmenting the centroid gene sequences and mapping into different dimensional spaces. Two ways of data splitting are investigated according to random splitting fragments along genome sequence (DI) , or separating genome sequences into two parts (DII).Two strategies of fragments recognition tasks, dimension-by-dimension and sequence--by--sequence, are investigated. The k-mer size selection, overlap of segmentation and effects of random split percents are also discussed. Experiments on 12390 maker genome sequences belonging to marker families of 17 species from MetaRef show that, both for DI and DII in dimension-by-dimension and sequence-by-sequence recognition, the recognition accuracy rates can achieve above 28% in top-1 candidate, and above 91% in top-10 candidate both on training and testing sets overall.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/12/2020

Bayesian Weighted Triplet and Quartet Methods for Species Tree Inference

Inference of the evolutionary histories of species, commonly represented...
05/12/2022

SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping

A critical step of genome sequence analysis is the mapping of sequenced ...
12/18/2017

Phylogenomics with Paralogs

Phylogenomics heavily relies on well-curated sequence data sets that con...
07/02/2019

Machine Learning based Prediction of Hierarchical Classification of Transposable Elements

Transposable Elements (TEs) or jumping genes are the DNA sequences that ...
05/17/2021

Comparison of machine learning and deep learning techniques in promoter prediction across diverse species

Gene promoters are the key DNA regulatory elements positioned around the...
12/12/2019

The Metagenomic Binning Problem: Clustering Markov Sequences

The goal of metagenomics is to study the composition of microbial commun...
01/03/2021

Segmentation and genome annotation algorithms

Segmentation and genome annotation (SAGA) algorithms are widely used to ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Tremendous genomic sequences measured by new generation sequencing machines from second-generation (2G) to third-generation (3G) and fourth-generation (4G) platforms [1] put forward high level requirement of automatical classification and identification for genomic analysis techniques. Designing more accurate and effective models for clades or species classification and disease diagnose are challenging problems facing the sea amount of genomic fragments. In recent three decades, many machine learning and statistical learning methods are developed in genomic analysis, such as BLAST[2-4], hidden Markov model (HMM) [5-15], support vector machine (SVM)[16-34], combination of linear discriminant analysis (LDA) and artificial neural network (ANN) [35], etc.

HMM is a powerful model in microbial clade classification and genome associate disease analysis [5,14,15]. However, HMM has a limitation in genomic analysis that the number of states in HMM model (

) grows exponentially with the increasing of k-mer size, and consequently leading to spareness of gene segments for training each state, this phenomenon is distinct especially in rare microbes and diseases. SVM is another popular technique to solve the relative sparse data case in pattern recognition and machine learning [36-39]. Many SVM literatures address the prediction of the microbial genomes classification and prediction [16,23,28,31]. The standard SVM [36-39] is adopted in our fragments modeling.

Feature selection is the first and key step in genomic information analysis. k-mer based sequence binning method and K-means clustering method are two widely used techniques in genomic analysis. k-mer binning method is widely adopted in genomic analysis[40-43], SVM–based genomic analysis[31], and LDA & ANN based genomic analysis platform[35]. K–means clustering is utilized in both HMM–based [7] and SVM–based genomic analysis[28]. Although these two preprocessing methods can obtain statistical property of genome sequences, they may mask and neglect the subtle or particular local information along genome sequence. In genomic analysis, the genomic information in some species or diseases DNA sequences may exist in some local position under some kinds of measurements. The difference among various cancers DNA sequences from a same human being may appear in some local places of human DNA sequences under an appropriate measurement. For microbial classification, the ideal model of classification and clustering is to make the fragments in same specie more closer than those from other species. Hence, the real fragments along gene sequences are taken into account in this paper. One of our motivation is to examine the accuracy of each fragment belonging to considering species, regardless of the big data problem and time computation consumption. This kind of classification task is called dimension-by-dimension classification. The k-mer size and overlap size are two important issues in this kind of feature extraction method, and investigated in experiments. Furthermore, in another point of view, if all the extracted fragments from a same sequence are taken as a whole of connected subsequence, the classification task to determine the specie of the pseudo subsequence is called sequence-by-sequence classification. In previous metagenomic sequences analysis literatures, too short length sequences are cut off, this limitation is not valid in our model framework.

Precision and recall is a popular criterion in statistical biology information analysis besides accuracy [44]. Based on precision and recall, area under curve (AUC) of receiver operating characteristics (ROC) is a metric for classification[45]. [46]points out that the practical application of ROC curve to determine parameter optimization is the optimal operating point (OOP) . Limited to over–fitting problem in machine learning, Precision & recall criterion and ROC curve learning also fall into over–optimism[47]. Based on the multi-class SVM employed in experiments [39], the maximum results over all parameters of SVM on both training and testing data sets are proposed to show the performance of classification on fragmental genome sequence in both dimension-by-dimension or sequence-by-sequence tasks and avoid the over–optimization on training data sets. Ranking is a popular technique in Bioinformatics [48,49] and speech recognition, where it is called N-best algorithm [50-52], the SVM combined with N-best algorithm framework is put forward to report the experimental results and give an intuitive grasp of the confusion of microbial species.

Additionally, fragments from genome sequences of maker families from all microbial species on MetaRef {http://metaref.org} involved in the experiments show the performance of SVM with N-best model and the effectiveness of k-mer size and overlap in microbial genomic information analysis.

2 Materials and Methods

2.1 Sample preparation

The genomic sequences are manually extracted from MetaRef and centroids_v.1.0.fna file[53], the marker genomic sequences of all microbes are totally included in 17 species {Mycoplasma gallisepticum, Alkaliphilus metalliredigens, Streptococcus gallolyticus, Enterococcus gallinarum, Geobacter metallireducens, Treponema pallidum, Phaeobacter gallaeciensis, Bifidobacterium gallicum, Cupriavidus metallidurans, Isosphaera pallida, Burkholderia mallei, Burkholderia pseudomallei, Eubacterium hallii, Leuconostoc fallax, Mycoplasma alligatoris, Prevotella pallens, Vibrio coralliilyticus}, and 12390 pure DNA {A, C, T, G} sequences of marker families without ambiguous DNA [54] are involved in the experiments. The frequency distribution of 12390 gene sequence lengths is shown in Fig.1. All of the DNA sequences are involved in the experiments from minimum 51 to maximum 14298, and average length is 748.9265, no short length DNA sequence is abandoned. In pattern recognition task, the 17 species are treated as 17 classes, and the total multi–class number M is set to 17. The {A, C, T, G} is mapped into {1,2,3,4} respectively. According to the definitions of marker family gene and centroid [53], the selected 12390 genes represent all the clades under the 17 species.

Figure 1: The frequency of 12390 gene sequences length (bp).
(a)
(b)
Figure 2: Framework of SVM with N-best candidates recognition (a) on DI (b) on DII.

To investigate k-mer size of gene sequences, the k-mer sizes of {10, 20, 30, 40, 50} are discussed, and the overlaps of {0, 25%, 50%, 75% } percent of k-mer length are examined. Two types of fragment split strategies are discussed as follows. Taking all genome sequences in each specie into account, for fixed k-mer size , overlap size and split percent , the first strategy (DI) is to segment each genome sequence into k–dimension (bp) sample space, and randomly select the fragments according to given split percent for training data and the rest part (1-split percent) is denoted as testing data(Fig.2(a)). The second strategy (DII) is to randomly split all genome sequences into training and testing parts according to given split percent, then all the sequences in training and testing data sets are mapped into k–dimension sample space separatively according to given k-mer size and overlap size(Fig.2(b)) . DI type data set is obtained along each sequence, although the training and testing data sets are separated according to split percent, the fragments from same sequence still keep the inherent biology information. However, training and testing data sets in DII type never contain this kind of biological information except that these fragments are in same specie property.

After the above segmentation process, all the fragments in the above data sets are treated as dimension-by-dimension segment of genomic sequence. The pattern recognition task on each k–length fragment is called dimension-by-dimension classification.

Taking a further consideration, especially when overlap size is equal to 0, the fragments in training and testing data sets from a same sequence could be treated as two pseudo subsequences of one genomic sequence. As they are randomly selected, the original genomic order information along the same sequence in DI is broken down. The pattern recognition tasks on randomly pseudo genome subsequences in DI and normal order genome subsequences in DII are called sequence-by-sequence classification. While the overlap percent is larger than 0, all the subsequences in DI and DII are pseudo subsequences, the pattern recognition task can still be performed on, it is also named as sequence-by-sequence classification.

In brief, the genomic sequences are mapped into k–dimension sample space, which means the k–bp genomic fragments. To construct the SVM model, the split percents involved in experiments are set as {60%, 80%, 100% }, When split proportion is equal to 100% , the training and testing set are set as same one. For both types of DI and DII, the training sets and testing sets are denoted as and respectively. Some main size of samples in and involved in experiments are listed in Supplementary Table 1.

2.2 Multi-class SVM

Support Vector Machine (SVM) is a popular statistical learning and machine learning method based on structural risk minimization and VC dimensions theory, and overcomes the dimension disaster of neural network [36]. It has efficient performance and high accuracy in many classification tasks, hence widely used in various information process fields. The traditional SVM is defined for binary classification, the multi-class SVM is defined on binary SVM with one-versus-one max-wins voting strategy or one-versus-all winner-takes-all strategy [37,38]. Suppose the samples data are . Given a nonlinear mapping function , the sample data of is projected to high dimensional space to obtain well separation of different samples. The standard C-SVM [40] solves

(1)

where

are the parameters of optimal linear hyperplane

, C is the penalty parameter of the error term. is the kernel function. The key problem in solving

is involved in the dual problem of above optimization problem and selection of kernel function. The radial basis function(RBF) kernel function is adopted in the experiments, where

Standard LibSVM 3.18 { http://www.csie.ntu.edu.tw/ cjlin/ libsvm/ }is utilized in modeling the fragment vector space, and the parameter range of C-SVM with RBF kernel is {0.0625, 0.125, 0.25, 0.5, 1, 2, 3, 4 } {0.0625, 0.125, 0.25, 0.5, 1, 2, 3, 4 }. Totally 64 parameter choices are employed in experiments.

2.3 Classification Accuracy

For genomic fragment classification, two categories of classification criteria are involved in experiments. The first one is for dimension-by-dimension classification, its aim is to classify the correct accuracy of fragment in k–bp status.

(2)

The other criterion is to recognize the pseudo subsequence composed of randomly selected fragments from a same sequence for right specie,

(3)

The criteria of and examine the different aspect of gene fragment sequences, both of them address the fragments of partial pieces of genomic sequences.

2.4 Multi–candidate Accuracy

Ranking is a popular technique in genomic analysis and information sciences. In order to show the relationship among the species, the top n

ranking classes are considered in the recognition stage according to the top-n probability outputs of multi-class SVM voting. The criteria of classification accuracy with N-best algorithm are revised as follows respectively,

(4)
(5)

3 Results

3.1 Accuracy on DI with different k-mer size and Overlap

As there are many combinations of k-mer size, overlap and split percent. Firstly, we investigate the effectiveness of these parameters in a special case that all data is involved in training, that is split percent=100% on DI. And, the case of overlap percent=0 is examined first of all, denoted as . The dimension–by-dimension recognition experimental results show that on training set , the accuracy rates trend highly along the increasing of number of N-candidates and k-mer size value (Fig.3(a), Supplementary Fig.1(a)).

Secondly, we examine the performance on with k=50, the accuracy rates increase along number of N-candidates (Fig.3(b)) and decrease along overlap percent (Supplementary Fig.1(b)).

  
(a) (b)
  
(c) (d)
  
(e) (f)
Figure 3: Dimension-by-Dimension recognition rates on DI along multi–candidates. (a) (b) (c) (d) (e) (f).

Thirdly, we perform multi-class SVM with N-best algorithm on DI with split percent=60% and 80% cases. In the dimension-by-dimension case, in training sets, the accuracy rates increase along number of candidates (Fig.3(c)(d)) and k-mer size (Supplementary Fig.1(c)(d)). While in testing sets, the accuracy rates increase along number of candidates (Fig.3(e)(f)) and slightly decrease along k-mer size (Supplementary Fig.1(e)(f)) especial in low candidate values.

Fourthly, the multi-class SVM with N-best algorithm are performed in sequence-by-sequence cases of and , the same conclusions hold as dimension-by-dimension cases (Fig.4(a)(b),Supplementary Fig.2(a)(b)). At same time, the accuracy rate of recognition in pseudo sequence-by-sequence form is more higher than in dimension-by-dimension fragment form.

Fifthly, in the sequence-by-sequence cases on DI with split percent=60% and 80% cases, the accuracy rates increase along number of candidates (Fig.4(c)(d)), but will not do so along k-mer size (Supplementary Fig.2(c)(d)) in training sets. The accuracy rates increase from 10 to 30 of k-mer size from top-1 to top-6 candidate cases, however, decrease in top-10 case within k-mer size from 10 to 30 (Supplementary Fig.2(c)(d)). From 30 to 50 k-mer size, the accuracy rates decrease from top-2 to top-10 in (Supplementary Fig.2(c)) and from top-3 to top-10 in (Supplementary Fig.2(d)). These experimental results demonstrate the tendency of accuracy rates with k-mer size and top-n candidates. In each fixed k-mer size, the principle that the larger the number of top-n candidates, the higher the accurate rates still holds. While, in testing data, The accuracy rates under each fixed k-mer size increases along top-n candidates (Fig.4(e)(f)), however, the accuracy rates under each fixed top-n candidates decreases along k-mer size (Supplementary Fig.2(e)(f)). The appropriate choice of k-mer size for genomic fragments and sequences predictions would be 10.

  
(a) (b)
  
(c) (d)
  
(e) (f)
Figure 4: Sequence–by-sequence recognition rates on DI along multi–candidates. (a) (b) (c) (d) (e) (f).

Sixthly, as to the overlap problem on DI with split percent=60% and 80% for training and testing cases separatively, we give the discussion in the case of 50 k-mer size data sets. In the dimension-by-dimension case, the performances on training sets and testing sets are different. In training sets, the accuracy rates under each overlap percent increase along the top-n candidates (Fig.5(a)(b)) , and the accuracy rates under each top-n candidates decrease along the overlap percent (Supplementary Fig.3(a)(b)). In testing sets, the accuracy rates under each overlap percent increase along the top-n candidates (Fig.5(c)(d)), and the accuracy rates under each top-n candidates are not precisely synchronized with the increasing of overlap percent (Supplementary Fig.3(c)(d)), in low top-n case, there are slightly enhancement, but do not hold in large top-n cases.

Seventhly, in sequence-by-sequence cases of 50 k-mer size data sets with split percent=60% and 80% for training and testing sets separatively on DI, for fixed overlap percent, the accuracy rates increases on both training (Fig.6(a)(b)) and testing data sets (Fig.6(c)(d)). Given top-n candidates, the accuracy rates perform different along overlap percent. In training sets, the accuracy rates decrease along low top-n candidates cases and increase in high top-n candidates cases (Supplementary Fig.4(a)(b)). But, in testing data, the accuracy rates increase along overlap percent (Supplementary Fig.4(c)(d)). The above experimental results show that overlap optimization is somewhat complicate in training and testing sets. The optimized overlap would be the balance between training and testing sets.

  
(a) (b)
  
(c) (d)
Figure 5: dimension-by-dimension recognition rates on DI. (a) (b) (c) (d) .
  
(a) (b)
  
(c) (d)
Figure 6: sequence-by-sequence recognition rates on DI. (a) (b) (c) (d) .

The experimental results on DI above show that, on training sets, the high k-mer size is preferred, on testing sets, the relative low k-mer will obtain relative good performance. As to the overlap percent, the sequence-by-sequence recognition on testing data sets would prefer higher value of overlap. All of the experimental results of this section are listed in Supplementary Table 2 and Supplementary Table 3.

3.2 Accuracy on DII with different k-mer size

This section we will show the performance of multi-class SVM with N-best algorithm on DII, which is different from DI in biology information, where experiments on DI would be treated as a kind of gene prediction, and the experiments on DII would be treated as a kind of gene category. The experimental results on data sets with split percents 60% and 80% are investigated. All the experimental results of dimension-by-dimension cases are listed in Supplementary Table 4, and sequence-by-sequence cases are listed in Supplementary Table 5.

Firstly, in dimension-by-dimension case, the accuracy rates increase along top-n candidates and k-mer size (Fig.7(a)(b)(c)(d)) on training data sets,, while the accuracy rates on testing data sets increase along top-n candidates and slightly decrease along k-mer size in top-n candidates case, in other top-n case, the fluctuation of accuracy rate along k-mer size is not distinctive (Supplementary Fig.5(a)(b)(c)(d)).

  
(a) (b)
  
(c) (d)
Figure 7: dimension-by-dimension recognition rates on DII. (a) (b) (c) (d) .

Secondly, in sequence-by-sequence cases, the accuracy rates increase along both top-n candidates and k-mer size on training data sets (Fig.8(a)(b)(c)(d)), while the accuracy rates increase along top-n candidates and slightly decrease along k-mer size on testing data sets (Supplementary Fig.6(a)(b)(c)(d)).

Again, the appropriate selection of k-mer size on DII type data sets would be 10. This conclusion is also helpful to model metagenomic sequences with HMM though it is difficult for HMM framework to examine the high dimensional k-mer size cases.

  
(a) (b)
  
(c) (d)
Figure 8: sequence-by-sequence recognition rates on DII. (a) (b) (c) (d) .

Moreover, from all the experimental results on split percent 60% and 80%, the principles along parameters of overlap and k-mer size have almost the same tendency, which is also provide the reasonable explanations of our models and performance. Regardless of DI and DII, the fact that 10 k-mer size with top-10 candidates can achieve high recognition rates demonstrates that the centroid marker family genomic sequences of all microbial would be highly discriminated with the toleration of confusion of 10 species among the 17 species within the multi-class SVM with N-best algorithm framework, the confusion is mainly caused by the similarities among species measured by multi-class SVM model in the view of evolution[55,56].

4 Discussion

We develop the multi-class SVM based N-best algorithm on microbial species classification, realize the true sense of fragment classification of genomic sequences without preprocessing, and discuss the effectiveness of k-mer size,overlap and split percent. The genome sequences of maker family from all microbial on MetaRef are treated as time series and employed in experiments with different lengths. Generally, the experimental results on the given range of SVM parameters grids show that the larger the k-mer size from 10 to 50, the higher the accuracy rates on training sets. The conclusions on testing sets are reverse. The overlap of fragments also takes effect on different data sets. In the view of prediction, the low overlap is recommended. Furthermore, N-best algorithm shows that top-10 candidates would provide good experimental results on both dimension-by-dimension and sequence-by-sequence recognition tasks. The experiments are designed only based on the {A,C,T,G} information along genomic sequences, if the detail 3-D genomic structure information or biology information are available, the prediction accuracy rates would be higher. Though the multi-class SVM with N-best algorithm and feature selection are constructed for challenging fragment classification, it would be applied to cancer diagnose of genomic fragments in future. And, how to overcome the time–consumption and memory storage of SVM and propose new kernel SVM are our future research interests. Finally, great improvement of top-1 classification accuracy is also our research goal in microbial genomic analysis.

5 Acknowledgements

This paper is partially supported by CHINA SCHOLARSHIP COUNCIL (No. 201303070216), 863 Project of China (2008AA02Z306), it was finished while I visited University of Southern California from March.2014 to Feb.2015.

5.0.1 Conflict of interest statement.

None declared.

References

  • [1] Ku CS,Roukos DH.(2013) From next-generation sequencing to nanopore sequencing technology: paving the way to personalized genomic medicine. Expert Rev. Med. Devices, 10(1),1–6.
  • [2] Lipman DJ,Pearson WR.(1985) Rapid and sensitive protein similarity searches. Science, 227 (4693), 1435–1441.
  • [3] Pearson WR,Lipman DJ.(1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA, 85(8), 2444–2448.
  • [4] Altschul SF,Gish W,Miller W,Myers EW,Lipman DJ.(1990) Basic local alignment search tool. Journal of Molecular Biology, 215, 403–410.
  • [5] Krogh A,Mian IS,Haussler D.(1994) A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Research, 22, 4768–4778.
  • [6] Burge C,Karlin S.(1997) Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology, 268, 78–94.
  • [7]

    Salzberg SL,Delcher AL,Kasif S,White O.(1998) Microbial gene identification using interpolated Markov models.

    Nucleic Acids Research, 26(2), 544-548.
  • [8] Lukashin AV,Borodovsky M.(1998) GenMark.hmm: new solutions for gene finding. Nucleic Acids Research, 26(4), 1107-1115.
  • [9] Pedersen JS,Hein J.(2003) Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics, 19(2), 219-227.
  • [10] Cawley SL,Pachter L.(2003) HMM sampling and applications to gene conding and alternative splicing. Bioinformatics, 19 Suppl. 2: ii36–ii41.
  • [11] DePristo MA,Banks E,Poplin RE,Garimella KV,Maguire JR,Hartl C,Philippakis AA,del Angel G,Rivas MA, Hanna M, et al.(2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. , 43(5), 491–498.
  • [12] Abubucker S,Segata N,Goll J,Schubert AM,Izard J,Cantarel BL,Rodriguez-Mueller B,Zucker J, Thiagarajan M,Henrissat B, et al. (2012) Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput Biol, 8(6): e1002358.
  • [13] Segata N,Izard J,Waldron L,Gevers D,Miropolsky L,Garrett WS, Huttenhower C.(2011)Metagenomic biomarker discovery and explanation. Genome Biology, 12(6):R60.
  • [14] Segata N,Waldron L,Ballarini A,Narasimhan V,Jousson O,Huttenhower C.(2013) Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Methods,9(8), 811–814.
  • [15] Skewes-Cox P,Sharpton TJ,Pollard KS,DeRisi JL.(2014) Profile hidden Markov models for the detection of viruses within metagenomic sequence data. PLoS ONE, 9(8): e105067.
  • [16] Brown MP,Grundy WN,Lin D,Cristianini N,Sugnet CW,Furey TS,Ares M,Haussler D.(2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA, 97, 262–267.
  • [17] Ramaswamy S,Tamayo P,Rifkin R,Mukherjee S,Yeang CH,Angelo M,Ladd C,Reich M,Latulippe E,Mesirov JP, et al.(2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA, 98, 15149–15154.
  • [18] Guyon I,Weston J,Barnhill S,Vapnik V.(2002) Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422.
  • [19] Bao L,Sun ZR.(2002) Identifying genes related to drug anticancer mechanisms using support vector machine. FEBS Letters, 521, 109–114.
  • [20] Cho SB,Won HH.(2003) Machine learning in DNA microarray analysis for cancer classification. Proceedings of the First Asia-Pacific bioinformatics conference on Bioinformatics, 189-198.
  • [21] Su Y,Murali TM,Pavlovic V,Schaffer M,Kasif S.(2003) RankGene: identification of diagnostic genes based on expression data. Bioinformatics, 19(12), 1578–1579.
  • [22] Li F,Yang YM.(2005) Using recursive classification to discover predictive features. ACM Symposium on Applied Computing. March 13-17, 2005, Santa Fe, New Mexico, USA. 104–1058.
  • [23] Krause L,McHardy AC,Nattkemper TW,Puhler A,Stoye J,Meyer F.(2007) GISMO–gene identification using a support vector machine for ORF classification. Nucleic Acids Research, 35(2), 540–549.
  • [24] Zhou X,Tuck DP.(2007) MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data. Bioinformatics, 23(9), 1106–1114.
  • [25] Tsai MH,Chang JD,Chiu SH,Lai CH.(2007) Identification of marker genes discriminating the pathological stages in ovarian carcinoma by using support vector machine and systems biology. in Randall M, Abbass HA, Wiles J (eds): ACAL 2007, LNAI 4828, 381–389.
  • [26] Wu S,Zhang Y.(2008) A comprehensive assessment of sequence-based and template-based methods for protein contact prediction. Bioinformatics, 24, 924–931.
  • [27] Sinha S,Vasulu TS,De RK.(2009) Performance and evaluation of microRNA gene identification tools. Journal of Proteomics & Bioinformatics, 2, 336–343.
  • [28] Yousef M,Ketany M,Manevitz L,Showe LC,Showe MK.(2009) Classification and biomarker identification using gene network modules and support vector machines. BMC Bioinformatics, 10: 337.
  • [29] Liang Y,Zhang F,Wang J,Joshi T,Wang Y,Xu D. (2011) Prediction of drought-resistant genes in arabidopsis thaliana Using SVM-RFE. PLoS ONE, 6(7): e21750.
  • [30] Chen ZY,Li JP,Wei LW,Xu WX,Shi Y.(2011) Multiple-kernel SVM based multiple-task oriented data mining system for gene expression data analysis. Expert Systems with Applications, 38, 12151-12159.
  • [31] Liu YC,Guo JT,Hu GQ,Zhu HQ.(2013) Gene prediction in metagenomic fragments based on the SVM algorithm. BMC Bioinformatics, 14(Suppl 5):S12.
  • [32] Lu TP,Hsu YY,Lai LC,Tsai MH,Chuang EY.(2014) Identification of gene expression biomarkers for predicting radiation exposure. Scientific Reports, 4 : 6293.
  • [33] Maji S,Garg D.(2014)Hybrid approach using SVM and MM2 in splice site junction identification. Current Bioinformatics, 9, 76–85 .
  • [34] Zhang C,Yeung P,Beviglia L,Cancilla B,Tang T,Yen WC,Gurney A,Lewicki J,Hoey T,Kapoun AM.(2014) Predictive biomarker identification for response to vantictumab (OMP-18R5; anti-Frizzled) by mining gene expression data of human breast cancer xenografts. Cancer Res, 74(19 Suppl): Abstract nr 2830.
  • [35] Hoff KJ,Tech M,Lingner T,Daniel R, Morgenstern B, Meinicke P. (2008) Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinformatics, 9:217.
  • [36] Cortes C,Vapnik V.(1995) Support-vector networks. Machine Learning, 20(3), 273–279.
  • [37] Hsu CW,Lin CJ.(2002) A Comparison of Methods for Multiclass Support Vector Machines. IEEE Transactions on Neural Networks,13, 415–425.
  • [38] Duan KB, Keerthi SS. (2005) Which is the best multiclass SVM method? An empirical study. Multiple Classifier Systems, LNCS 3541. 278–285.
  • [39] Chang CC,Lin CJ.(2011)LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology,2, 27:1–27:27.
  • [40] McHardy AC,Martín HG,Tsirigos A,Hugenholtz P,Rigoutsos I.(2007) Accurate phylogenetic classification of variable-length DNA fragments. Nature methods, 4(1), 63–72.
  • [41] Chan CKK,Hsu AL,Halgamuge SK,Tang SL.(2008) Binning sequences using very sparse labels within a metagenome. BMC Bioinformatics, 9:215.
  • [42] Chikhi R,Medvedev P.(2014)Informed and automated k-mer size selection for genome assembly. Bioinformatics, 30(1), 31–37.
  • [43]

    Wu YW,Tang YH,Tringe SG,Simmons BA,Singer SW.(2014)MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm.

    Microbiome, 2:26.
  • [44] Powers DMW.(2011) Evaluation: from precision, recall and F-Factor to ROC, informedness, markedness & correlation. Journal of Machine Learning Technologies, 2(1), 37–63.
  • [45] Fawcelt T.(2006) An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.
  • [46] Liu JW, Qian MP.(2011) Protein function prediction using kernal logistic regresssion with ROC curves. Computing and Intelligent Systems, 491-502. Springer Berlin Heidelberg.
  • [47] Jelizarow M,Guillemot V,Tenenhaus A,Strimmer K,Boulesteix AL. (2010) Over-optimism in bioinformatics: an illustration.Bioinformatics, 26(16), 1990–1998.
  • [48] Broberg P.(2003) Statistical methods for ranking differentially expressed genes. Genome Biology, 4:R41.
  • [49] Boulesteix AL,Slawski M.(2009)Stability and aggregation of ranked gene lists. Briefings in Bioinformatics, 10(5), 556-568.
  • [50] Schwartz R,Chow YL.(1990) The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses. ICASSP-90, Apr.3-6,1990, Albuquerque, NM. 1,81-84.
  • [51] Pusateri E,Thong JMV.(2001) N-best list generation using word and phoneme recognition fusion. in Dalsgaard P, Lindberg B, Benner H, Tan ZH (eds), INTERSPEECH, ISCA. 1817–1820 .
  • [52]

    Williams JD,Balakrishnan S.(2009) Estimating probability of correctness for ASR N-best lists. in Healey PGT,Pieraccini R,Byron DK,Young S,Purver M (eds)

    SIGDIAL Conference, The Association for Computer Linguistics, 132–135 .
  • [53] Huang K,Brady A,Mahurkar A,White O,Gevers D,Huttenhower C,Segata N.(2014) MetaRef: a pan-genomic database for comparative and community microbial genomics. Nucleic Acids Research, 42, Database issue D617–D624.
  • [54] Liu JW.(2014) Statistical analysis of microbial genome sequence on MetaRef [E-letter]. Nucleic Acids Research(Dec.4,2014).
  • [55] Futuyma DJ.(2013) Evolution (3rd ed.). Sinauer Associates, Inc, Sunderland, Massachusetts.
  • [56] Lande R,Arnold SJ.(1983)The measurement of selection on correlated characters. Evolution, 37, 1210–1226.