1 Introduction
Since the 1990s, the whole genomes of a large number of species have been sequenced by their corresponding genome sequencing projects. In 1995, the first freeliving organism Haemophilus influenzae was sequenced by the Institute for Genomic Research pmid7542800 . In 1996, the first eukaryotic genome (Saccharomyces cerevisiase) was completely sequenced yeastGenome . In 2000, the first plant genome, Arabidopsis thaliana, was also sequenced by Arabidopsis Genome Initiative plantGenome . In 2004, the Human Genome Project (HGP) announced its completion pmid15496913 . Following the HGP, the Encyclopedia of DNA Elements (ENCODE) project was started, revealing massive functional putative elements on the human genome in 2011 pmid22955616 . The drastically decreasing cost of sequencing enables the 1000 Genomes Project to be carried out, resulting in an integrated map of genetic variation from 1,092 human genomes published in 2012 pmid23128226 . The massive genomic data generated by those projects impose an unforeseen challenge for largescale data analysis at the scale of gigabytes or even terabytes.
Computational methods are essential in analyzing the massive genomic data. They are collectively known as bioinformatics or computational biology; for instance, motif discovery compSurvey helps us distinguish real signal subsequence patterns from background sequences. Multiple sequence alignment blast can be used to study the similarity between multiple sequences. Protein structure prediction proteinstructurepredictoin ; wong2010protein can be applied to predict the 3D tertiary structure from an amino acid sequence. Gene network inference geneNetwork are the statistical methods to infer gene networks from correlated data (e.g. microarray data). Promoter prediction promoterPrediction help us annotate the promoter regions on a genome. Phylogenetic tree inference phylogeneticInference can be used to study the hierarchical evolution relationship between different species. Drug scheduling drugScheduling ; wong2012evolutionary can help solve the clinical scheduling problems in an effective manner. Although the precision of those computational methods is usually lower than the existing wetlab technology, they can still serve as useful preprocessing tools to significantly narrow search spaces. Thus prioritized candidates can be selected for further validation by wetlab experiments, saving time and funding. In particular, unsupervised learning methods are essential in analyzing the massive genomic data where the ground truth is limited for model training. Therefore, we describe and review several unsupervised learning methods for genome informatics in this chapter.
2 Unsupervised Learning for DNA
In human and other eukaryotes, gene expression is primarily regulated by the DNA binding of various modulatory transcription factors (TF) onto cisregulatory DNA elements near genes. Binding of different combinations of TFs may result in a gene being expressed in different tissues or at different developmental stages. To fully understand a gene’s function, it is essential to identify the TFs that regulate the gene and the corresponding TF binding sites (TFBS). Traditionally, these regulatory sites were determined by laborintensive experiments such as DNA footprinting or gelshift assays. Various computational approaches have been developed to predict TF binding sites in silico. Detailed comparisons can be found in the survey by Tompa et al. Tompa . TFBS are relatively short (1020 bp) and highly degenerate sequence motifs, which makes their effective identification a computationally challenging task. A number of highthroughput experimental technologies were developed recently to determine proteinDNA binding such as protein binding microarray (PBM) pmid16998473 , chromatin immunoprecipitation (ChIP) followed by microarray or sequencing (ChIPChip or ChIPSeq) chipchip ; chipseq , microfluidic affinity analysis pmid20802496 , and protein microarray assays pmid19879846 ; pmid16785442 .
On the other hand, it is expensive and laborious to experimentally identify TFTFBS sequence pairs, for example, using DNA footprinting Footprint , gel electrophoresis Gel_shift , and SELEX pmid2200121 . The technology of Chromatin immunoprecipitation (ChIP) chipchip ; chipseq measures the binding of a particular TF to the nucleotide sequences of coregulated genes on a genomewide scale in vivo, but at low resolution. Further processing are needed to extract precise TFBSs pmid12101404 . Protein Binding Microarray (PBM) was developed to measure the binding preference of a protein to a complete set of kmers in vitro pmid16998473 . The PBM data resolution is unprecedentedly high, comparing with the other traditional techniques. The DNA kmer binding specificities of proteins can even be determined in a single day. It has also been shown to be largely consistent with those generated by in vivo genomewide location analysis (ChIPchip and ChIPseq) pmid16998473 .
To store and organize the precious data, databases have been created. TRANSFAC is one of the largest databases for regulatory elements including TFs, TFBSs, weight matrices of the TFBSs, and regulated genes TRANSFAC06 . JASPAR is a comprehensive collection of TF DNAbinding preferences JASPAR2010 . Other annotation databases are also available (e.g. Pfam Pfam2004 , UniProbe pmid21037262 , ScerTF ScerTF , FlyTF FlyTF , YeTFaSCo pmid22102575 , and TFcat TFcat ). Notably, with the opensource and openaccess atmosphere widespreads on the Internet in recent years, a database called ORegAnno appeared in 2008 ORegAnno . It is an openaccess communitydriven database and literature curation system for regulatory annotation. The ENCODE consortium has also released a considerable amount of ChIPSeq data for different DNAbinding proteins pmid22955616 .
In contrast, unfortunately, it is still difficult and timeconsuming to extract the highresolution 3D proteinDNA (e.g. TFTFBS ) complex structures with XRay Crystallography xray or Nuclear Magnetic Resonance (NMR) spectroscopic analysis NMR . As a result, there is strong motivation to have unsupervised learningl methods based on existing abundant sequence data, to provide testable candidates with high confidence to guide and accelerate wetlab experiments. Thus unsupervised learning methods are proposed to provide insights into the DNA binding specificities of transcription factors from the existing abundant sequence data.
2.1 DNA Motif Discovery and Search
Transcription Factor Binding Sites (TFBSs) are represented in DNA motif models to capture its sequence degeneracy Wong:2011:GLP:2093590.2093604 . They are described in the following sections.
Representation (DNA Motif Model)
There are several motif models proposed. For example, consensus string representation, a set of motif instance strings, count matrix, position frequency matrix (PFM), and position weight matrix (PWM). Among them, the most popular motif models are the matrix ones. They are the count matrix, PFM, and PWM. In particular, the most common motif model is the zeroorder PWM which has been shown to be related to the average proteinDNA binding energy in the experimental and statistical mechanics study pmid3612791 . Nonetheless, it assumes independence between different motif positions. A recent attempt has been made to generalize PWM but the indel operations between different nucleotide positions are still challenging alllevelcomplexity
. Although the column dependence and indel operations could be modeled by Hidden Markov Model (HMM) simultaneously, the number of training parameters is increased quadratically. There is a dilemma between accuracy and model complexity.
Count Matrix
The count matrix representation is the de facto standard adopted in databases. In the count matrix representation, a DNA motif of width is represented as a by matrix . The th column of corresponds to the th position of the motif, whereas the th row of correspond to the th biological character. In the context of DNA sequence, we have 4 characters A,C,G,T. is the occurring frequency of the biological character at the th position. For example, the count matrix of the SOX9 protein (JASPAR ID:MA0077.1 and UniProt ID:P48436) is tabulated in the following matrix form. The motif width is 9 so we have a matrix here. The corresponding sequence logo is also depicted in Figure 1.
Notably, adenine has not been found at the 4th position, resulting in a zero value. It leads to an interesting scenario. Is adenine really not found at that position or the sample size (the number of binding sites we have found so far for SOX9) is too small that the sites having adenine at that position have not been verified experimentally yet ? To circumvent the problem, people usually add pseudo counts to the matrix which is justified from the use of prior probability in statistics
citeulike:163532 ; pmid19106141. Such techniques are also found in natural language computing and machine learning.
Position Frequency Matrix (PFM)
In practice, a count matrix is usually converted to PFM, and thus a zeroorder PWM for scanning a long sequence. The dimension and layout of count matrix is exactly the same as those of the corresponding PFM and zeroorder PWM. Their main difference is the element type. For count matrix, each element is simply a count. For PFM, each element is a Maximum Likelihood Estimate (MLE) parameter. For zeroorder PWM, each element is a weight.
To derive a PFM from a count matrix, , maximum likelihood estimation (MLE) is used mosesSurvey . Mathematically, we aim at maximizing the likelihood function . In addition, we impose a parameter normalization constraint for each position. It is added to the likelihood function with a Lagrange multiplier , resulting in a new log likelihood function:
By taking its partial derivatives to zero, it has been shown that . The MLE parameter definition is quite intuitive. It is simply the occurring fraction of a nucleotide at the same position. For example, given the previous SOX9 count matrix , we can convert it to a PFM as follows:
where a pseudocount = 1 is added to each element of .
We can observe that the most invariant positions of the SOX9 motif are the 4th and 6th position. At the 4th position, cytosines have been found most of the times while guanines and thymines have just been found few times.
Position Weight Matrix (PWM)
To scan a long sequence for motif matches using a PFM , we need to derive a PWM first so that the background distribution can be taken into account. As aforementioned, each element of PWM is a weight. Each weight can be viewed as a preference score. In practice, it is usually defined as the log likelihood ratio between the motif model and background model. Mathematically, where is the occurring fraction of the th nucleotide in all the background sequences such that, given a subsequence of the same width as the PWM (width ), we can compute a score by summation only:
(1)  
(2)  
(3)  
(4)  
(5) 
where is the numeric index for the nucleotide at th position of the subsequence . For example, given the previous SOX9 PFM , we can convert it to a PWM as follows:
where we have assumed that the background distribution is uniform (i.e. for ) for illustration purposes.
Learning (Motif Discovery)
In general, motif discovery aims at building motif models (e.g. PFM) from related sequences. Nonetheless, there is a variety of motif discovery methods in different biological settings. From a computing perspective, they can be classified into several paradigms by its input data types:

A Set of Sequences

A Set of Sequences with Quantitative Measurements

A Set of Orthologous Sequences
Motif Discovery for A Set of Sequences
The most classical one is de novo motif discovery which just takes a set of sequences as the inputs. The set of sequences is extracted such that a common transcription factor is believed to bind to them, assuming that motif models (e.g. consensus substrings) can be found from those sequences. For example, the promoter and enhancer sequences of the genes coregulated by a common transcription factor or the sequence regions around the next generation parallel sequencing peaks called for a common transcription factor. Theoretically, Zia and Moses have proved a theoretical upper bound on the pvalue which at least one motif with a specific information content occur by chance from background distribution (false positive) for the oneoccurrence per sequence motif discovery problem pmid22738169 .
Chan et al. applied evolutionary computation techniques to the problem
pmid18065426 ; kcwong:EASE . Hughes et al. proposed a Gibbs sampling algorithm called AlignACE, to sample and evaluate different possible motif models using a priori log likelihood scores pmid10698627. Workman et al. have proposed a machine learning approach using artificial neural networks (ANNSpec)
pmid10902194 . Hertz et al. utilized the maximal information content principle to greedily search for a set of candidate sequences for building motif models (Consensus) pmid10487864 . Frith et al. have adopted simulated annealing approaches to perform multiple local alignment for motif model building (GLAM) pmid14704356. Ao et al. have used expectation maximization to determine DNA motif position weight matrices (Improbizer)
pmid15375261 . Bailey et al. have proposed MEME to optimize the expected value of a statistic related to the information content of motif models pmid7584439 . A parameter enumeration pipeline wrapping MEME (MUSI) was proposed for elucidating multiple specificity binding modes pmid22210894 . Eskin et al. employed a tree data structure to find composite weak motifs (MITRA) pmid12169566 . Thijs et al. have further improved the classic Gibbs sampling method and called it MotifSampler pmid11751219 . Van Helden et al. have proposed a counting algorithm to detect statistically significant motifs pmid9719638 . Regnier and Denise have proposed an exhaustive search algorithm (QuickScore) regnier04:_rare. Favorov et al. have utilized Markov Chain Monte Carlo to solve the problem in a Bayesian manner (SeSiMCMC)
pmid15728117 . Pavesi et al. have proposed an exhaustively enumerated and consensusbased method (Weeder) pmid15215380. Another exhaustive search algorithm to optimize zscores (YMF) has been proposed by Sinha and Tompa
pmid12824371 .Although different statistical techniques have been developed for the motif discovery problem, most of the existing methods aim at building motif models in the form of either a set of strings or a zeroorder PWM. Nonetheless, it is well known that nucleotide dependencies and indel operations exist within some TFBSs (a.k.a. motifs) pmid17308339 ; pmid18184687 . It is desirable to develop new methods which can capture and model such information.
Motif Discovery for A Set of Sequences with Quantitative Affinity Measurements
It has been pointed out that a fundamental bottleneck in TFBS identification is the lack of quantitative binding affinity data for a large portion of transcription factors. The advancement of new highthroughput technologies such as ChIPChip, ChIPSeq, and Protein Binding Microarray (PBM) has made it possible to determine the binding affinity of these TFs (i.e. each sequence can be associated with a binding intensity value) pmid19443739 . In particular, the PBM technology can enable us to enumerate all the possible kmers, providing an unprecedentedly high resolution binding site affinity landscape for each TF. In light of this deluge of quantitative affinity data, robust probabilistic methods were developed to take into account those quantitative affinity data. Seed and Wobble has been proposed as a seedbased approach using rank statistics pmid16998473 . RankMotif++ was proposed to maximize the log likelihood of their probabilistic model of binding preferences pmid17646348 . MatrixREDUCE was proposed to perform forward variable selections to minimize the sum of squared deviations pmid16317069 . MDScan was proposed to combine two search strategies together, namely word enumeration and positionspecific weight matrix updating pmid12101404 . PREGO was proposed to maximize the Spearman rank correlation between the predicted and the actual binding intensities pmid16809671 . Notably, Wong et al. have proposed and developed a hidden Markov model approach to learn the dependence between adjacent nucleotide positions rigorously; they also show that their method (kmerHMM) can deduce multiple binding modes for a given TF pmid23814189 .
Note that this paradigm is a generalization from the motif discovery for a set of sequences with binary measurements. For example, SeedSearch Barash:2001:SHA:645906.673098 and DME pmid15668401 . In other words, it also includes the discriminative motif discovery method in which a set of motifcontaining sequences and a set of background sequences are given as the input since we can assign a value of 1 to each motifcontaining sequence and 0 to each background sequence.
Motif Discovery for A Set of Orthologous Sequences
It is generally acknowledged that by comparing evolutionarily related DNA or protein sequences, functionally important sequences or motifs can be revealed by such comparison. A functional motif is assumed to be more conserved across different species than the background sequences 15534694 . By incorporating the evolutionary conservation with sequencespecific DNA binding affinity, different methods have been proposed. Moses et al. have proposed an extension of MEME to take the sequence evolution into account probabilistically pmid14992514 . Kellis et al. have proposed a spaced hexamer enumeration approach to identify conserved motifs pmid15285895 . FootPrinter has been proposed as a substringparsimony based approach using dynamic programming to find statistically enriched motif substrings pmid12015878 . A Gibbs sampling approach named PhyloGibbs has also been proposed pmid16477324 .
Prediction (Motif Search)
After a motif model has been found, it is always desirable to apply it to search for motif instances over a given sequence (e.g. ChIPSeq peak sequences). Some basic search methods have been developed to search motif instances over a sequence. Nonetheless, those methods do not have sufficient motif model complexity to distinguish false positives from true positives over a long sequence (e.g. 100k bp) pmid15131651 . To cope with that, some improvements have been made. In general, most of them utilize the biological information beyond the motif sequence specificity to augment the motif model complexity insufficiency. In particular, multiple motif information and evolutionary conservation have been readily adopted to improve the discovery accuracy mosesSurvey .
Basic Searches
Likelihood Ratio
Given a sequence of length and a PWM of the motif of width , we scan with a window of width such that the subsequences which likelihood ratio score is higher than a prespecified threshold are considered the instances (hits) of the motif . Mathematically, a subsequence is considered as a motif instance (hit) if and only if the following condition is satisfied:
where is the th nucleotide among {A,C,G,T} and is the Iverson bracket. Nonetheless, different motifs may have different likelihood score distributions. It is difficult to set a single and fixed threshold which can work for all the motifs. To solve the problem, one can normalize the score to the interval [0,1] based on the maximal and minimal scores as follows:
Posterior Ratio
If we know the prior probability of motif occurrence , it can also be incorporated into the scoring function in a posterior manner mosesSurvey . Mathematically, given a sequence , motif model (including PFM and PWM ), and background distribution , we can compute the posterior ratio as follows:
(6)  
(7)  
(8)  
(9)  
(10)  
(11) 
It can be observed that the posterior ratio can be computed from the likelihood ratio by simply adding the logarithm of the prior probability ratio.
Pvalue
Given the previous scoring functions, it is not easy to set a threshold since they are just ratios. For example, if in the above example, it just means the likelihood that the sequence is generated by the motif model is higher than the background and vice versa. To justify it in a meaningful way, Pvalue distribution can be calculated from a motif model. Given a motif PWM of width , an exhaustive search can be applied to traverse all the possible sequences of width . Nonetheless, it takes time complexity for the DNA alphabet . Interestingly, if the PWM is of zeroorder, we can exploit the column independence assumption and apply dynamic programming to calculate the exact Pvalue distribution in time complexity pmid2720468 . In practice, the empirical Pvalue distribution may also be used.
Nonetheless, the specificity of a PWM of width is still not high if it is applied to a very long sequence of length . Mathematically, even if we just assign the best match as the hit, hits are still expected (e.g. If and , hits are expected), assuming that the sequence is uniform in background nucleotide distribution. To solve the problem, people have spent efforts on incorporating more biological information to improve the motif search.
Incorporating Multiple Motif Information
To improve motif search, multiple motif information can be incorporated. Multiple motif sites are usually clustered together, resulting in higher signaltonoise ratios which can be easier to be detected than alone. If multiple sites of the same motif are clustered together within a short distance, it is called homotypic clustering pmid12670999 . On the other hand, if multiple sites of different motifs are clustered together within a short distance, it is called heterotypic clustering.
To exploit the additional clustering signals beyond sequence specificity, MAST was proposed to multiply the Pvalues of multiple motif matches (hits) together, which has demonstrated superior performance in sequence homology search than the other two methods proposed in the same study pmid9672829 . CISANALYST was proposed as a sliding window approach to predict the windows which have at least motif matches (hits) with pvalues pmid11805330 . Sinha et al. have proposed a probabilistic model, Stubb, to efficiently detect clusters of binding sites (i.e. cisregulatory modules) over genomic scales using maximum likelihood estimation pmid16845069 . To determine the window size parameter, a window size adjustment procedure has been used in ClusterBuster to find clusters of motif matches pmid12824389 . Segal et al. have also derived an expectation maximization algorithm to model the clusters of motif matches as probabilistic graphical models pmid12855470 . Recently, Hermann et al. have proposed an integrative system (icisTarget) to combine the highthroughput next generation sequencing data with motif matches to provide accurate motif cluster search pmid22718975 . Notably, Zhou and Wong have shown that it is possible to search for clusters of motifs in a de novo way (i.e. without any given motif model and information) pmid15297614 .
Incorporating Evolutionary Conservation
Another approach to improve motif search is to incorporate evolutionary conservation. The rationale behind that is similar to that behind phylogenetic motif discovery which we have described in a previous section. Deleterious mutations will be removed from population by negative selection. To make use of that fact, we could imply that a true motif match should be more conserved across closely related species (For example, chimpanzee and mouse) than background sequences 15534694 . For instance, a windowing approach with several thresholds for motif matches and conservation, ConSite, was proposed by Sandelin et al. pmid15215389 . Nonetheless, it is limited to pairwise analysis. rVISTA is a similar approach pmid16888351 using the Match program for motif matching in TRANSFAC TRANSFAC06 . Bayesian Branch Length Score (BBLS) was proposed as a evolutionary conservation score without relying on any multiple sequence alignment pmid19017655 . A parsimonious method for finding statistically significant kmers with dynamic programming was proposed (FoorPrinter) pmid11997340 . Notably, Moses et al. proposed a comprehensive probabilistic model to search for motif instances with efficient pvalue estimation (MONKEY) pmid15575972 .
To search for novel motif instances, there are programs aimed at searching for motif matches without any given motif model and information. For instance, Ovcharenko et al. have used likelihood ratio tests to distinguish conserved regions from the background pmid15173121 . Siepel et al. have demonstrated an approach in identifying conserved regions using hidden Markov models called PhastCons pmid16024819 .
Incorporating Both Approaches
Both motif clustering information and evolutionary conservation were demonstrated beneficial to motif search. Since they are independent of each other, it is straightforward to combine them. Philippakis et al. have proposed a method to combine both types of information (i.e. motif clustering and evolutionary conservation), achieving good performance on experimentally verified datasets pmid15759656 . MONKEY has been extended by Warner et al. to exploit the motif clustering information to predict motif clusters (PhylCRM) pmid18311145 . It has been reported that the misalignment errors of the input reference sequences from other species could affect the quality of phylogenetic footprinting for motif search. Thus statistical alignments have been used to assist motif search in EMMA pmid19293946 . Stub has also been extended to StubMS to take multiple species conservation into account using a HMM phylogenetic model pmid12855472 . Notably, a unified probabilistic framework which integrates multiple sequence alignment with binding site predictions, MORPH, was proposed by the same group pmid17997594 . Its effectiveness has been demonstrated and verified in an independent comparison study pmid21152003 .
2.2 Genomewide DNA Binding Pattern Discovery
Chromatin immunoprecipitation (ChIP) followed by highthroughput sequencing (ChIPSeq) measures the genomewide occupancy of transcription factors in vivo. In a typical ChIPSeq study, the first step is to call the peaks, i.e. determining the precise location in the genome where the TF binds. A number of peak calling tools have been developed; for instance, modelbased analysis of ChIPSeq data (MACS) was proposed to model the shift size of ChIPSeq tags and local biases to improve its peakcalling accuracy pmid18798982 . Spp is another method with a strong focus on background signal correction pmid19029915 . PeakSeq is a twopass strategy method. The first pass accounts for the sequence mappability while the second pass is to filter out statistically insignificant regions comparing to controls pmid19122651 . CisGenome refines peak boundaries and uses a conditional binomial model to identify peak regions pmid18978777 . However, recent benchmark studies suggest that their predicted peaks are distinct from each other pmid20628599 ; pmid20017957 .
Different combinations of DNAbinding protein occupancies may result in a gene being expressed in different tissues or at different developmental stages. To fully understand a gene’s function, it is essential to develop unsupervised learning models on multiple ChIPSeq profiles to decipher the combinatorial regulatory mechanisms by multiple transcription factors.
Since multiple transcription factors often work in cis regulatory modules to confer complex gene regulatory programs, it is necessary to develop models on multiple ChIPSeq datasets to decipher the combinatorial DNAbinding mechanism. In the following, we briefly review some of the previous works in this area. Gerstein et al. used pairwise peak overlapping patterns to construct a human regulatory network pmid22955619
. Xie et al. proposed self organizing map methods to visualize the colocalization of DNAbinding proteins
pmid24243024 . Giannopoulou et al. proposed a nonnegative matrix factorization to elucidate the clustering of DNAbinding proteins pmid23554462 . Zeng and colleagues proposed jMOSAiCS to discover histone modification patterns across multiple ChIPSeq datasets pmid23844871 . Ferguson et al. have described a hierarchical Bayes approach to integrate multiple ChIPSeq libraries to improve DNA binding event predictions. In particular, they have applied the method to histone ChIPSeq libraries and predicted the gene locations associated with the expected pathways ferguson2012new . Mahony et al. also proposed a mixture model (MultiGPS) to detect differential binding enrichment of a DNAbinding protein in different cell lines, which can improve the protein’s DNA binding location predictions (i.e. Cdx2 protein in their study) mahony2014integrated . On the other hand, Chen et al. proposed a statistical framework (MMChIP) based on MACS to perform an integrative analysis of multiple ChIP datasets to predict ChIPenriched regions with known motifs for a given DNAbinding protein (i.e. ER and CTCF proteins in their study) chen2011mm. On the other hand, Ji et al. proposed a differential principal component analysis method on ChIPSeq to perform unsupervised pattern discovery and statistical inference to identify differential proteinDNA interactions between two biological conditions
ji2013differential . Guo et al. described a generative probabilistic model (GEM) for high resolution DNA binding site discovery from ChIP data guo2012high . Interestingly, that model combines ChIP signals and DNA motif discovery together to achieve precise predictions of the DNA binding locations of a DNAbinding protein. The authors have further demonstrated how GEM can be applied to reveal spatially constrained transcription factor binding site pairs on a genome.Despite the success of the methods described above, to fully understand a gene’s function, it is essential to develop probabilistic models on multiple ChIPSeq profiles to decipher the genomewide combinatorial patterns of DNAbinding protein occupancy. Unfortunately, the majority of the previous work usually focused on largescale clustering of called peaks, which is an intuitive and straightforward approach. However such approaches have two limitations, as (i) peakcalling ignores the contributions from weak bindings of TFs, and (ii) pairwise analysis ignores the complex combinatorial binding pattern among the TFs. Thus an unsupervised learning model called SignalSpider has been proposed to directly analyze multiple normalized ChIPSeq signal profiles on all the promoter and enhancer regions quantitatively so that weak bindings can be taken into account wong2014signalspider ; pmid22955978 . Especially, its computational complexity has been carefully designed to scale with the increasing ChIPSeq data (i.e. linear complexity). With such a linear complexity, the method (SignalSpider) has been successfully applied to more than 100 ChIPSeq profiles in an integrated way, revealing different genomewide DNAbinding modules across the entire human genome (hg19) wong2014signalspider .
3 Unsupervised Learning for inferring microRNA regulatory network
While transcription factors (TFs) are the major transcriptional regulator proteins, microRNA (miRNA), a small 22 nucleotide noncoding RNA species, has been shown to play a crucial role in posttranscriptional and/or translational regulation Bartel:2009fh . Since the 1993 discovery of the first miRNA let7 in worms, a vast amount of studies have been dedicated to functionally characterizing miRNAs with a special emphasis on their roles in cancer. While TFs can serve either as a transcriptional activator or as a repressor, miRNAs are primarily known to confer mRNA degradation and/or translational repression by forming imperfect basepair with the target sites primarily at the 3 untranslated regions of the messenger RNAs Guo:2010kh . While miRNAs are typically 22 nt long, several experimental studies combined with computational methods Lewis:2003uz ; Doench:2004ct ; Kiriakidou:2004kc ; Burgler:2005ge ; Lewis:2005cb ; Alexiou:2009jq ; Yue:2009jt have shown that only the first six or seven consecutive nucleotides starting at the second nucleotide from the 5 end of the miRNA are the most crucial determinants for target site recognition (Figure 2). Accordingly, the 6mer or 7mer close to the 5 region of the miRNA is termed as the “seed” region or seed. miRNAs that share common seeds belong to an miRNA family as they potentially target a vastly common set of mRNAs. Moreover, the target sites at the 3UTR WatsonCrick (WC) pairing with the miRNA seed are preferentially more conserved within mammalian or among all of the vertebrate species Lewis:2003uz . In humans, more than one third of the genes harbour sites under selective pressure to maintain their pairing to the miRNA seeds Lewis:2003uz ; Grimson:2007cy . An important variation around this seedtarget pairing scheme was discovered by Lewis et al. (2005), where the target site is flanked by a conserved adenosine ‘A’ facing to the first nucleotide of the targeting miRNA Lewis:2005cb .
The dynamics of the miRNA regulatory network are implicated in various phenotypic changes including embryonic development and many other complex diseases Song:2006hm ; Croce:2009ff . Although abnormal miRNA expression can sometimes be taken as a stronger indicator of carcinoma in clinical samples than aberrant mRNA expression Ramaswamy:2001hc ; Lu:2005hx , the system level mechanistic effects are usually unclear. A single miRNA can potentially target 400 distinct genes, and there are thousands of distinct endogenous miRNAs in the human genome. Thus, miRNAs are likely involved in virtually all biological processes and pathways including carcinogenesis. However, functional characterizing miRNAs hinges on the accurate identification of their mRNA targets, which has been a challenging problem due to imperfect basepairing and conditionspecific miRNA regulatory dynamics. In this section, we discuss the current stateofart approaches in referring miRNA or miRNAmediated transcriptional regulatory network. Table 1 summarizes these methods. As we will see, each method is established through an effective unsupervised learning model by exploiting the static sequencebased information pertinent to the prior knowledge of miRNA targeting and/or the dynamic information of miRNA activities implicated by the recently available large data compendia, which interrogate genomewide expression profiles of miRNAs and/or mRNAs across various cell conditions.
mRNA5[ORF]  
Method  Algorithm  Section  Ref 

PicTar  Hidden Markov Model  3.1  Krek:2005er 
TargetScore  Variational Bayesian Mixture Model  3.2  Li:2013wk 
GroupMiR  Nonparametric Bayesian with Indian Buffet Process  3.3  Le:2011wx 
SNMNMF  Constrained nonnegative matrix factorization  3.4  Zhang:2011ce 
Mirsynergy  Deterministic overlapping neighbourhood expansion  3.5  Li:2014fa 
PicTar: Probabilistic identification of combinations of Target sites; GroupMiR: Group MiRNA target prediction; SNMNMF: Sparse Networkregularized Multiple Nonnegative Matrix Factorization; PIMiM: Protein Interactionbased MicroRNA Modules.
3.1 PicTar
PicTar (Probabilistic identification of combinations of Target sites) is one of the few models that rigorously considers the combinatorial miRNA regulations on the same target 3UTR Krek:2005er . As an overview, PicTar first prefilters target sites by their conservation across select species. However, the fundamental framework of PicTar is based on hidden Markov model (HMM) with a maximum likelihood (ML) approach, which is built on the logics of several earlier works from Siggia group Durbin:1998wz ; Bussemaker:2000dl ; Sinha:2003ui ; Rajewsky:2002tt . Among these works, PicTar was most inspired by “Ahab”, an HMMbased program developed (by the same group) to predict the combinatorial TF binding sites (TFBS) Rajewsky:2002tt . Although PicTar has been successfully applied to three studies on vertebrates Krek:2005er (where the original methodology paper was described), fly Grun:2005ix , and worm Lall:2006hh (where some improvements were described), the description of the core HMM algorithm of PicTar is rather brief. Here we will lay out the detailed technicality of the algorithm based on the information collected from several related works Durbin:1998wz ; Bussemaker:2000dl ; Sinha:2003ui ; Rajewsky:2002tt ; MBishop:2006uc , which will help highlight its strengths, limitations, and possible future extensions.
Let be a 3UTR sequence, the length of , and the target sites for miRNA “word” of length , the transition probability of the occurrence of miRNA , and the transition probability for the background of length , which is simply estimated from the fraction of A, U, G, C (i.e., Markov model of order 0) either in with length 300 nt or from all query UTRs. To simplify notation, the background letters are treated as a special word so that and . Thus, can be represented by multiple different ways of concatenating the segments corresponding to either miRNA target sites or background. The goal is to obtain at any arbitrary nucleotide position of the 3UTR sequence
that is the last position of the word, where is the model parameters controlling emission probabilities (see below).Following Markov’s assumption, is proportional to the products of the probabilities before and after position , which can be computed in time by ForwardBackward algorithm as described below. Formally,
(12)  
(13)  
(14)  
(15) 
where is the likelihood of sequence or the objective function to be maximized in the ML framework, and can be represented in recursion forms and computed via forward and backward algorithm, respectively in time for words. Formally, the forward algorithm is derived as follows:
(16)  
(17)  
(18)  
(19)  
(20)  
(21) 
where is the emission probability assumed known (see below) and is the transition probability to word . Note that is position independent. To start the recursion, .
The backward algorithm is similarly derived:
(22)  
(23)  
(24)  
(25) 
To start the backward recursion, , which is simply the frequency of word in the 3UTR sequence .
Given the emission probabilities, ’s (transition probability) are the only parameters that need to be set in order to maximize . Following the ML solution,
(26) 
where the posterior is calculated by Eq (15), which is in turn computed by forwardbackward algorithm. Finally, the likelihood (objective) function is evaluated by a simple forward pass to the end of the sequence:
(27)  
(28) 
Together, the optimization of in PicTar is performed using BaumWelch algorithm for ExpectationMaximization (EM) as summarized in Algorithm 1. Finally, the PicTar score is defined as a log ratio () of the ML over background likelihood :
(29) 
where is the likelihood when only background is considered.
As previously mentioned, the emission probabilities in PicTar are assumed known. In Ahab for modelling TFBS, or is based on the position frequency matrix (PFM): , where is the normalized frequency of the nucleotide at the position of the PFM. However, miRNAs do not have PFM. The original PicTar arbitrarily sets to be 0.8 if there is perfect seedmatch at 17 or 28 nt positions of the miRNA 5 end AND the free binding energy as estimated by RNAhybrid is no less than 33% of the optimal binding energy of the entire miRNA sequence to the UTR Lorenz:2011it ; otherwise is set to 0.2 divided by for imperfect seed matches with only 1 mismatch allowed, provided it is above 66% of the optimal binding energy. Thus, the setting highly disfavours imperfect seed match. The later version of PicTar changes the emission probability calculation to be the total number of occurrences in conserved 3UTR sites divided by the total number of sites 3UTR Lall:2006hh . The setting appears to improve the sensitive/specificity of the model but makes it more dependent on the crossspecies conservation, potentially prone to false negatives (for the nonconserved but functional sites).
The major advantage of PicTar over other simpler methods is that the coordinate actions of the miRNAs (synergistic in case of optimally spaced sites or antagonistic in case of overlapping binding sites) are naturally captured within the emission and transition probabilities. For instance, the as the joint ML of multiple miRNA target sites will be higher than the linear sum of individual miRNA target sites (i.e., synergistic effects). Longer 3UTR will score less than shorter 3UTR if both contain the same number of target sites. PicTar demonstrated a comparable signal:noise ratio relative to TargetScan and was compared favourably with some of the earlier published methods based on several surveys Alexiou:2009jq ; Yue:2009jt ; Sethupathy:2006df ; Li:2013wk . When applied to vertebrates, PicTar identified roughly 200 genes per miRNA, which is a rather conservative estimate compared to the recent findings by TargetScan with a conserved targeting scoring approach Friedman:2009km . When applied to C. elegans (worm), PicTar identified 10% of the C. elegans genes that are under conserved miRNA regulation.
Nonetheless, PicTar has three important limitations. First, PicTar does not consider the correlation between miRNA target sites since is essentially position independent. This is perhaps largely due to the increased model complexity when considering all pairwise transition probabilities between miRNAs and background since there will be parameters to model (as apposed to only ). Supported by Grimson:2007cy , however, the specific spatial arrangement of the target sites may be functionally important. In particular, the optimal distance between miRNA sites was estimated as 8 to
40 nt based on transfection followed by regression analysis
Grimson:2007cy . Although unsupported by experimental evidence, the ordering of some specific target sites may be also important. For instance, target site must be located before target site (for the same or different miRNA) to achieve optimal synergistic repression. The model that takes into account the spatial correlation between motifs is called the hcHMM (history conscious HMM) implemented in a program called Stubb for detecting TF binding sites (TFBS) rather than miRNA target predictions Sinha:2006bn .Second, the ML approach is prone to local optimal especially for long UTRs or many coordinated miRNA actions considered simultaneously (i.e., many ’s). An alternative HMM formulation is to impose Bayesian priors on the HMM parameters MacKay97ensemblelearning . In particular, Wu:2008fq
demonstrated such Bayesian formalism of HMM in modelling combinatorial TFBS. In their model socalled “module HMM”, the transition probabilities is assumed to follow a Dirichlet distribution with hyperparameters
, where corresponds to the background, to the TFs, and to the background inside of cisregulatory module. Inference is performed via Markov Chain Monte Carlo (MCMC) procedure or Gibbs sampling in particular. Briefly, a forward pass and backward pass are run to generate marginal probabilities at each nucleotide position. Starting at the end of the sequence, hidden states are sampled at each position based on the marginals until reaching the front of the sequence. Given the hidden states, the hyperparameters for Dirichlet distribution of the transition probabilities are then updated by simply counting the occurrences of each state. The posteriors of the hidden states at each nucleotide position are inferred as the averaged number of times the states are sampled in 1000 samplings, and the states with the maximum a posteriori (MAP) are chosen. The model is demonstrated to perform better than Ahab and Stubb (which are the basis for PicTar) in TFBS predictions but have not yet been adapted to miRNA target predictions.Third, since data for expression profiling of mRNAs and miRNAs by microarrays or RNAseq is now rather abundant (e.g., Babak:2004ec , GSE1738; GSE31568, Keller:2011jk ; GSE40499, Meunier:2012dv from GEO) or ENCODE (GSE24565, Djebali:2013hc ) or TCGA CancerGenomeAtlasResearchNetwork:2008gr ; Weinstein:2013jp , the combinatorial regulation needs to be revisited by taking into account whether or not the cooperative miRNAs are indeed expressed in vivo and/or the expression correlation between mRNA and miRNA. In particular, the emission probabilities need to be redefined to integrate both sequencebased and expressionbased information.
3.2 A probabilistic approach to explore human miRNA target repertoire by integrating miRNAoverexpression data and sequence information
One of the most direct way to query the targets of a given miRNA is by transfecting the miRNA into a cell and examine the expression changes of the cognate target genes Lim:2005cd . Presumably, a bona fide target will exhibit decreased expression upon the miRNA transfection. In particular, overexpression of miRNA coupled with expression profiling of mRNA by either microarray or RNAseq has proved to be a promising approach Lim:2005cd ; Arvey:2010ev . Consequently, genomewide comparison of differential gene expression holds a new promise to elucidate the global impact of a specific miRNA regulation without solely relying on evolutionary conservation. However, miRNA transfection is prone to offtarget effects. For instance, overexpressing a miRNA may not only repress the expression of its direct targets but also cause a cascading repression of indirect targets of the affected transcription activators. To improve prediction accuracy for direct miRNA targets, this chapter describes a novel model called TargetScore that integrates expression change due to miRNA overexpression and sequence information such as context score Grimson:2007cy ; Garcia:2011ec and other orthogonal sequencebased features such as conservation Friedman:2009km into a probabilistic score.
In one of our recent papers, we described a novel probabilistic method for miRNA target prediction problem by integrating miRNAoverexpression data and sequencebased scores from other prediction methods Li:2013wk
. Briefly, each score feature is considered as an independent observed variable, which is the input to a Variational BayesianGaussian Mixture Model (VBGMM). We chose a Bayesian over a maximum likelihood approach to avoid overfitting. Specifically, given expression foldchange (due to miRNA transfection), we use a threecomponent VBGMM to infer downregulated targets accounting for genes with little or positive foldchange (due to offtarget effects
Khan:2009ki ). Otherwise, twocomponent VBGMM is applied to unsigned sequence scores. The parameters for the VBGMM are optimized using Variational Bayesian ExpectationMaximization (VBEM) algorithm. The mixture component with the largest absolute means of observed negative foldchange or sequence score is associated with miRNA targets and denoted as “target component”. The other components correspond to the “background component”. It follows that inferring miRNAmRNA interactions is equivalent to inferring the posterior distribution of the target component given the observed variables. The targetScore is computed as the sigmoidtransformed foldchange weighted by the averaged posteriors of target components over all of the features.Bayesian mixture model
Assuming there are genes, we denote as the expression foldchange () or sequence scores (). Thus, for sets of sequence scores, . To simplify the following equations, we use to represent one of the independent variables without loss of generality. To infer target genes for a miRNA given , we need to obtain the posterior distribution of the latent variable , where =3 (=2) for modelling signed (unsigned) scores such as logarithmic foldchanges (sequence scores).
We follow the standard Bayesian GMM based on MBishop:2006p485 (p474482) with only minor modifications. Although univariate GMM () is applied to each variable separately, we implemented and describe the following formalism as a more general multivariate GMM, allowing modeling the covariance matrices. Briefly, the latent variables are sampled at probabilities (mixing coefficient), that follow a Dirichlet prior with hyperparameters . To account for the relative frequency of targets and nontargets for any miRNA, we set the (associated with the target component) to and other , where (by default). Assuming
follows a Gaussian distribution
, where (precision matrix) is the inverse covariance matrix, together follow a GaussianWishart prior , where the hyperparameters .Variational Bayesian Expectation Maximization
Let . The marginal log likelihood can be written in terms of lower bound
(first term) and KullbackLeibler divergence
(second term):(30) 
where is a proposed distribution for , which does not have a closed form distribution. Because is a constant, maximizing implies minimizing . The general optimal solution is the expectation of variable w.r.t other variables, . In particular, we define . The expectations for the three terms (at log scale), namely , have the same forms as the initial distributions due to the conjugacy of the priors. However, they require evaluation of the parameters , which in turn all depend on the expectations of or the posterior of interest:
(31) 
where . The interdependence between the expectations and model parameters falls naturally into an EM framework, namely VBEM. Briefly, we first initialize the model parameters based on priors and randomly sample data points . At the iteration, we evaluate (31) using the model parameters (VBE step) and update the model parameters using (31) (VBM step). The EM iteration terminates when improves by less than (default). Please refer to MBishop:2006p485 for more details.
TargetScore
We define the targetScore as an integrative probabilistic score of a gene being a target (meaning that for the target component ) of a miRNA:
(32) 
where , is the posterior in (31).
TargetScore demonstrates superior statistical power compared to existing methods in predicting validated miRNA targets in various human celllines. Moreover, the confidence targets from TargetScore exhibit comparable protein downregulation and are more significantly enriched for Gene Ontology terms. Using TargetScore, we explored oncomironcogenes network and predicted several potential cancerrelated miRNAmessenger RNA interactions. TargetScore is available at Bioconductor http://www.bioconductor.org/packages/devel/bioc/html/TargetScore.html.
Networkbased methods to detect miRNA regulatory modules
Although targets of individual miRNAs are significantly enriched for certain biological processes Papadopoulos:2009kt ; Tsang:2010jv , it is also likely that multiple miRNAs are coordinated together to synergistically regulate one or more pathways Krek:2005er ; Boross:2009gx ; Xu:2011kl . Indeed, despite their limited number (2578 mature miRNAs in human genome, miRBase V20, Kozomara:2014he ), miRNAs may be in charge of more evolutionarily robust and potent regulatory effects through coordinated collective actions. The hypothesis of miRNA synergism is also parsimonious or biologically plausible because the number of possible combinations of the 2578 human miRNAs is extremely large, enough to potentially react to virtually countless environmental changes. Intuitively, if a group of (miRNA) workers perform similar tasks together, then removing a single worker will not be as detrimental as assigning each worker a unique task Boross:2009gx .
Several related methods have been developed to study miRNA synergism. Some early methods were based on pairwise overlaps Shalgi:2007fe or scorespecific correlation Xu:2011kl between predicted target sites of any given two (coexpressed) miRNAs. For instance, Shalgi et al. (2007) devised an overlapping scoring scheme to account for differential 3UTR lengths of the miRNA targets, which may otherwise bias the results if standard hypergeometric test was used Shalgi:2007fe . Methods beyond pairwise overlaps have also been described. These methods considered not only the sequencebased miRNAtarget site information but also the respective miRNAmRNA expression correlation (MiMEC) across various conditions to detect miRNA regulatory modules (MiRMs).
For instance, Joung et al. (2007) developed a probabilistic search procedure to separately sample from the mRNA and miRNA pools candidate module members with probabilities proportional to their overall frequency of being chosen as the “fittest”, which was determined by their target sites and MiMEC relative to the counterparts Joung:2007kj . The algorithm found only the single best MiRM, which varied depending on the initial mRNA and miRNA set. Other networkbased methods using either the sequence information only or using m/miRNA expression profiles only as a filter for a more diseasefocused network construction on only the differentially expressed (DE) m/miRNAs. For instance, Peng et al.(2006) employed an enumeration approach to search for maximal biclique on DE m/miRNAs to discover complete bipartite subgraphs, where every miRNA is connected with every mRNA Peng:2009js . The approach operated on unweighted edges only, which required discretizing miRNAmRNA expression correlation. Also, maximal biclique does not necessarily imply functional MiRMs and vice versa.
The following subsections review in details three recently developed network methods (Table 1) to detect MiRMs. Despite distinct unsupervised learning frameworks, all three methods exploit the widely available paired m/miRNA expression profiles to improve upon the accuracy of earlier developed (sequencebased) network approaches.
3.3 GroupMiR: inferring miRNA and mRNA group memberships with Indian Buffet Process
The expressionbased methods reviewed elsewhere Yue:2009jt were essentially designed to explain the expression of each mRNA in isolation using a subset of the miRNA expression in a linear model with a fixed set of parameters. However, the same mRNAs (miRNAs) may interact with different sets of miRNAs (mRNAs) in different pathways. The exact number of pathways is unknown and may grow with an increase of size or quality of the training data. Thus, it is more natural to infer the number of common features shared among different groups of miRNAs and mRNAs. Accordingly, Le et al. (2011) proposed a powerful alternative model called GroupMiR (Group MiRNA target prediction) Le:2011wx . As an overview, GroupMiR first explored the latent binary features or memberships possessed within mRNAs, miRNAs, or shared between mRNAs and miRNAs on a potentially infinite binary feature space empowered by a nonparametric Bayesian (NBP) formalism. Thus, the number of features was inferred rather than determined arbitrarily. Importantly, the feature assignment took into account the prior information for miRNA and mRNA targeting relationships, obtained from sequencebased target prediction tools such as TargetScan or PicTar. Based on the shared memberships, mRNAs and/or miRNAs formed groups (or clubs). The same miRNAs (mRNAs) could possess multiple memberships and thus belong to multiple groups each corresponding to a latent feature. This was also biologically plausible since a miRNA (mRNA) may participate in several biological processes. Similar to GenMiR++ Huang:2007em
, GroupMiR then performed a Bayesian linear regression on each mRNA expression using
all miRNA expression but placing more weight on the expression of miRNAs that shared one or more common features with that mRNA.Specifically, the framework of GroupMiR was based on a recently developed general nonparametric Bayesian prior called the Indian Buffet Process (IBP) Griffiths:2005tp (which was later on proved to be equivalent to Beta process Thibaux:2007wn ). As the name suggests, IBP can be understood from an analogy of a type of an ‘Indian buffet’ as follows. A finite number of customers or objects form a line to enter one after another a buffet comprised of dishes or features. Each customer samples dishes selected by previous customers, and Poisson() new dishes, where is a model parameter. The choices of customers on the dishes are expressed in an binary matrix . A leftorder function maps a binary matrix to a leftordered binary matrix with columns (i.e., dishes) sorted from left to right by decreasing order of and breaking ties in favour of customers who enter the buffet earlier. This process defines an exchangeable distribution on equivalence class comprising all of the that have the same leftordered binary matrix regardless of the order the customers enter the buffet (i.e., row order) or the dish order (i.e., column order).
Before reviewing the IBP derivation, we need to establish some notations Griffiths:2005tp . () denotes the history of feature at object , which is encoded by a single decimal number. At object , for instance, a feature has one of four histories encoded by corresponding to all of the four possible permutations of choices for objects 1 and 2: . Accordingly, for objects, there are histories for each feature and histories excluding the history of all zeros (i.e., ). Additionally, denotes the number of features possessing the same history , for all features with , and for all features for which . Thus, . It is easy to see that binary matrices belong to an equivalence class if and only if they have the same history profile for each feature . The cardinality () of an equivalence class is the number of all of the binary matrices with the same history profile:
(33) 
As shown below, Eq (33) is essential in order to establish the closeformed solution of IBP prior when leads to an infinite feature space or infinite number of columns in . After establishing the above properties, the central steps in deriving the IBP prior used in GroupMiR is reviewed below. I will focus on some steps neglected from the original work and refer the reader to the full derivation when appropriate. As in the original papers, we first derive IBP on a finite number of latent features and then take the limit making use of Eq (33).
Let be an binary matrix, where for mRNAs and miRNAs, and is the number of latent features. Assuming the binary value in for each feature is sampled from and are conditionally independent given
, the joint distribution of
is then:(34) 
where follows a Beta prior with :
(35) 
where is a Beta function. To take into account the prior information between miRNA and mRNA targeting from sequencebased predictions, GroupMiR incorporated in Eq (35) an weight matrix :
(36) 
where interaction within mRNAs and within miRNAs were set to zeros and interaction between mRNA and miRNA followed the scoring matrix obtained from a quantitative sequencebased predictions. In particular, MiRanda scores were used in their paper. Thus, is either 0 or defined as a pairwise potential of interactions between mRNA and miRNA . The modified was then defined as:
(37) 
where and the partition function were defined as:
(38)  
(39) 
The marginal probability of is derived by integrating out as follows:
(40)  
(41)  
(42)  
(43)  
(44)  
(45) 
where in Eq (43) is the sum over all , and (45) directly follows the definition of Beta function. However, when since the probability of sampling a specific binary matrix from an infinite number of matrices is 0. Instead, the inference was performed over the equivalence class with the number of equivalent matrices defined above:
(46)  
(47)  
(48) 
where
(49) 
A more elaborate derivation of (48) was described in the Appendix from Le:2011wx and omitted here. Additionally, the authors also showed that when or equivalently for all histories , then Eq (48) reduces to the original IBP introduced in Griffiths:2005tp , which is thus a special case of the weighted IBP in GroupMiR.
Given the IBP prior Eq (48), the generative process for corresponding to an existing feature (where ) was derived as follows:
(50)  
(51)  
Comments
There are no comments yet.