Unsupervised Learning in Genome Informatics

08/03/2015 ∙ by Ka-Chun Wong, et al. ∙ UNIVERSITY OF TORONTO City University of Hong Kong 0

With different genomes available, unsupervised learning algorithms are essential in learning genome-wide biological insights. Especially, the functional characterization of different genomes is essential for us to understand lives. In this book chapter, we review the state-of-the-art unsupervised learning algorithms for genome informatics from DNA to MicroRNA. DNA (DeoxyriboNucleic Acid) is the basic component of genomes. A significant fraction of DNA regions (transcription factor binding sites) are bound by proteins (transcription factors) to regulate gene expression at different development stages in different tissues. To fully understand genetics, it is necessary of us to apply unsupervised learning algorithms to learn and infer those DNA regions. Here we review several unsupervised learning methods for deciphering the genome-wide patterns of those DNA regions. MicroRNA (miRNA), a class of small endogenous non-coding RNA (RiboNucleic acid) species, regulate gene expression post-transcriptionally by forming imperfect base-pair with the target sites primarily at the 3' untranslated regions of the messenger RNAs. Since the 1993 discovery of the first miRNA let-7 in worms, a vast amount of studies have been dedicated to functionally characterizing the functional impacts of miRNA in a network context to understand complex diseases such as cancer. Here we review several representative unsupervised learning frameworks on inferring miRNA regulatory network by exploiting the static sequence-based information pertinent to the prior knowledge of miRNA targeting and the dynamic information of miRNA activities implicated by the recently available large data compendia, which interrogate genome-wide expression profiles of miRNAs and/or mRNAs across various cell conditions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since the 1990s, the whole genomes of a large number of species have been sequenced by their corresponding genome sequencing projects. In 1995, the first free-living organism Haemophilus influenzae was sequenced by the Institute for Genomic Research pmid7542800 . In 1996, the first eukaryotic genome (Saccharomyces cerevisiase) was completely sequenced yeastGenome . In 2000, the first plant genome, Arabidopsis thaliana, was also sequenced by Arabidopsis Genome Initiative plantGenome . In 2004, the Human Genome Project (HGP) announced its completion pmid15496913 . Following the HGP, the Encyclopedia of DNA Elements (ENCODE) project was started, revealing massive functional putative elements on the human genome in 2011 pmid22955616 . The drastically decreasing cost of sequencing enables the 1000 Genomes Project to be carried out, resulting in an integrated map of genetic variation from 1,092 human genomes published in 2012 pmid23128226 . The massive genomic data generated by those projects impose an unforeseen challenge for large-scale data analysis at the scale of gigabytes or even terabytes.

Computational methods are essential in analyzing the massive genomic data. They are collectively known as bioinformatics or computational biology; for instance, motif discovery compSurvey helps us distinguish real signal subsequence patterns from background sequences. Multiple sequence alignment blast can be used to study the similarity between multiple sequences. Protein structure prediction proteinstructurepredictoin ; wong2010protein can be applied to predict the 3D tertiary structure from an amino acid sequence. Gene network inference geneNetwork are the statistical methods to infer gene networks from correlated data (e.g. microarray data). Promoter prediction promoterPrediction help us annotate the promoter regions on a genome. Phylogenetic tree inference phylogeneticInference can be used to study the hierarchical evolution relationship between different species. Drug scheduling drugScheduling ; wong2012evolutionary can help solve the clinical scheduling problems in an effective manner. Although the precision of those computational methods is usually lower than the existing wet-lab technology, they can still serve as useful preprocessing tools to significantly narrow search spaces. Thus prioritized candidates can be selected for further validation by wet-lab experiments, saving time and funding. In particular, unsupervised learning methods are essential in analyzing the massive genomic data where the ground truth is limited for model training. Therefore, we describe and review several unsupervised learning methods for genome informatics in this chapter.

2 Unsupervised Learning for DNA

In human and other eukaryotes, gene expression is primarily regulated by the DNA binding of various modulatory transcription factors (TF) onto cis-regulatory DNA elements near genes. Binding of different combinations of TFs may result in a gene being expressed in different tissues or at different developmental stages. To fully understand a gene’s function, it is essential to identify the TFs that regulate the gene and the corresponding TF binding sites (TFBS). Traditionally, these regulatory sites were determined by labor-intensive experiments such as DNA footprinting or gel-shift assays. Various computational approaches have been developed to predict TF binding sites in silico. Detailed comparisons can be found in the survey by Tompa et al. Tompa . TFBS are relatively short (10-20 bp) and highly degenerate sequence motifs, which makes their effective identification a computationally challenging task. A number of high-throughput experimental technologies were developed recently to determine protein-DNA binding such as protein binding microarray (PBM) pmid16998473 , chromatin immunoprecipitation (ChIP) followed by microarray or sequencing (ChIP-Chip or ChIP-Seq) chipchip ; chipseq , microfluidic affinity analysis pmid20802496 , and protein microarray assays pmid19879846 ; pmid16785442 .

On the other hand, it is expensive and laborious to experimentally identify TF-TFBS sequence pairs, for example, using DNA footprinting Footprint , gel electrophoresis Gel_shift , and SELEX pmid2200121 . The technology of Chromatin immunoprecipitation (ChIP) chipchip ; chipseq measures the binding of a particular TF to the nucleotide sequences of co-regulated genes on a genome-wide scale in vivo, but at low resolution. Further processing are needed to extract precise TFBSs pmid12101404 . Protein Binding Microarray (PBM) was developed to measure the binding preference of a protein to a complete set of k-mers in vitro pmid16998473 . The PBM data resolution is unprecedentedly high, comparing with the other traditional techniques. The DNA k-mer binding specificities of proteins can even be determined in a single day. It has also been shown to be largely consistent with those generated by in vivo genome-wide location analysis (ChIP-chip and ChIP-seq) pmid16998473 .

To store and organize the precious data, databases have been created. TRANSFAC is one of the largest databases for regulatory elements including TFs, TFBSs, weight matrices of the TFBSs, and regulated genes TRANSFAC06 . JASPAR is a comprehensive collection of TF DNA-binding preferences JASPAR2010 . Other annotation databases are also available (e.g. Pfam Pfam2004 , UniProbe pmid21037262 , ScerTF ScerTF , FlyTF FlyTF , YeTFaSCo pmid22102575 , and TFcat TFcat ). Notably, with the open-source and open-access atmosphere wide-spreads on the Internet in recent years, a database called ORegAnno appeared in 2008 ORegAnno . It is an open-access community-driven database and literature curation system for regulatory annotation. The ENCODE consortium has also released a considerable amount of ChIP-Seq data for different DNA-binding proteins pmid22955616 .

In contrast, unfortunately, it is still difficult and time-consuming to extract the high-resolution 3D protein-DNA (e.g. TF-TFBS ) complex structures with X-Ray Crystallography xray or Nuclear Magnetic Resonance (NMR) spectroscopic analysis NMR . As a result, there is strong motivation to have unsupervised learningl methods based on existing abundant sequence data, to provide testable candidates with high confidence to guide and accelerate wet-lab experiments. Thus unsupervised learning methods are proposed to provide insights into the DNA binding specificities of transcription factors from the existing abundant sequence data.

2.1 DNA Motif Discovery and Search

Transcription Factor Binding Sites (TFBSs) are represented in DNA motif models to capture its sequence degeneracy Wong:2011:GLP:2093590.2093604 . They are described in the following sections.

Representation (DNA Motif Model)

There are several motif models proposed. For example, consensus string representation, a set of motif instance strings, count matrix, position frequency matrix (PFM), and position weight matrix (PWM). Among them, the most popular motif models are the matrix ones. They are the count matrix, PFM, and PWM. In particular, the most common motif model is the zero-order PWM which has been shown to be related to the average protein-DNA binding energy in the experimental and statistical mechanics study pmid3612791 . Nonetheless, it assumes independence between different motif positions. A recent attempt has been made to generalize PWM but the indel operations between different nucleotide positions are still challenging alllevelcomplexity

. Although the column dependence and indel operations could be modeled by Hidden Markov Model (HMM) simultaneously, the number of training parameters is increased quadratically. There is a dilemma between accuracy and model complexity.

Count Matrix

The count matrix representation is the de facto standard adopted in databases. In the count matrix representation, a DNA motif of width is represented as a -by- matrix . The th column of corresponds to the th position of the motif, whereas the th row of correspond to the th biological character. In the context of DNA sequence, we have 4 characters A,C,G,T. is the occurring frequency of the biological character at the th position. For example, the count matrix of the SOX9 protein (JASPAR ID:MA0077.1 and UniProt ID:P48436) is tabulated in the following matrix form. The motif width is 9 so we have a matrix here. The corresponding sequence logo is also depicted in Figure 1.

Figure 1: DNA Motif Sequence Logo for SOX9 (JASPAR ID:MA0077.1 and UniProt ID:P48436). The vertical axis measures the information content, while the horizontal axis denotes the positions. The consensus string is DAACAATRG, following the standard IUPAC nucleotide code.

Notably, adenine has not been found at the 4th position, resulting in a zero value. It leads to an interesting scenario. Is adenine really not found at that position or the sample size (the number of binding sites we have found so far for SOX9) is too small that the sites having adenine at that position have not been verified experimentally yet ? To circumvent the problem, people usually add pseudo counts to the matrix which is justified from the use of prior probability in statistics

citeulike:163532 ; pmid19106141

. Such techniques are also found in natural language computing and machine learning.

Position Frequency Matrix (PFM)

In practice, a count matrix is usually converted to PFM, and thus a zero-order PWM for scanning a long sequence. The dimension and layout of count matrix is exactly the same as those of the corresponding PFM and zero-order PWM. Their main difference is the element type. For count matrix, each element is simply a count. For PFM, each element is a Maximum Likelihood Estimate (MLE) parameter. For zero-order PWM, each element is a weight.

To derive a PFM from a count matrix, , maximum likelihood estimation (MLE) is used mosesSurvey . Mathematically, we aim at maximizing the likelihood function . In addition, we impose a parameter normalization constraint for each position. It is added to the likelihood function with a Lagrange multiplier , resulting in a new log likelihood function:

By taking its partial derivatives to zero, it has been shown that . The MLE parameter definition is quite intuitive. It is simply the occurring fraction of a nucleotide at the same position. For example, given the previous SOX9 count matrix , we can convert it to a PFM as follows:

where a pseudocount = 1 is added to each element of .

We can observe that the most invariant positions of the SOX9 motif are the 4th and 6th position. At the 4th position, cytosines have been found most of the times while guanines and thymines have just been found few times.

Position Weight Matrix (PWM)

To scan a long sequence for motif matches using a PFM , we need to derive a PWM first so that the background distribution can be taken into account. As aforementioned, each element of PWM is a weight. Each weight can be viewed as a preference score. In practice, it is usually defined as the log likelihood ratio between the motif model and background model. Mathematically, where is the occurring fraction of the th nucleotide in all the background sequences such that, given a subsequence of the same width as the PWM (width ), we can compute a score by summation only:


where is the numeric index for the nucleotide at th position of the subsequence . For example, given the previous SOX9 PFM , we can convert it to a PWM as follows:

where we have assumed that the background distribution is uniform (i.e. for ) for illustration purposes.

Learning (Motif Discovery)

In general, motif discovery aims at building motif models (e.g. PFM) from related sequences. Nonetheless, there is a variety of motif discovery methods in different biological settings. From a computing perspective, they can be classified into several paradigms by its input data types:

  1. A Set of Sequences

  2. A Set of Sequences with Quantitative Measurements

  3. A Set of Orthologous Sequences

Motif Discovery for A Set of Sequences

The most classical one is de novo motif discovery which just takes a set of sequences as the inputs. The set of sequences is extracted such that a common transcription factor is believed to bind to them, assuming that motif models (e.g. consensus substrings) can be found from those sequences. For example, the promoter and enhancer sequences of the genes co-regulated by a common transcription factor or the sequence regions around the next generation parallel sequencing peaks called for a common transcription factor. Theoretically, Zia and Moses have proved a theoretical upper bound on the p-value which at least one motif with a specific information content occur by chance from background distribution (false positive) for the one-occurrence per sequence motif discovery problem pmid22738169 .

Chan et al. applied evolutionary computation techniques to the problem

pmid18065426 ; kcwong:EASE . Hughes et al. proposed a Gibbs sampling algorithm called AlignACE, to sample and evaluate different possible motif models using a priori log likelihood scores pmid10698627

. Workman et al. have proposed a machine learning approach using artificial neural networks (ANN-Spec)

pmid10902194 . Hertz et al. utilized the maximal information content principle to greedily search for a set of candidate sequences for building motif models (Consensus) pmid10487864 . Frith et al. have adopted simulated annealing approaches to perform multiple local alignment for motif model building (GLAM) pmid14704356

. Ao et al. have used expectation maximization to determine DNA motif position weight matrices (Improbizer)

pmid15375261 . Bailey et al. have proposed MEME to optimize the expected value of a statistic related to the information content of motif models pmid7584439 . A parameter enumeration pipeline wrapping MEME (MUSI) was proposed for elucidating multiple specificity binding modes pmid22210894 . Eskin et al. employed a tree data structure to find composite weak motifs (MITRA) pmid12169566 . Thijs et al. have further improved the classic Gibbs sampling method and called it MotifSampler pmid11751219 . Van Helden et al. have proposed a counting algorithm to detect statistically significant motifs pmid9719638 . Regnier and Denise have proposed an exhaustive search algorithm (QuickScore) regnier04:_rare

. Favorov et al. have utilized Markov Chain Monte Carlo to solve the problem in a Bayesian manner (SeSiMCMC)

pmid15728117 . Pavesi et al. have proposed an exhaustively enumerated and consensus-based method (Weeder) pmid15215380

. Another exhaustive search algorithm to optimize z-scores (YMF) has been proposed by Sinha and Tompa

pmid12824371 .

Although different statistical techniques have been developed for the motif discovery problem, most of the existing methods aim at building motif models in the form of either a set of strings or a zero-order PWM. Nonetheless, it is well known that nucleotide dependencies and indel operations exist within some TFBSs (a.k.a. motifs) pmid17308339 ; pmid18184687 . It is desirable to develop new methods which can capture and model such information.

Motif Discovery for A Set of Sequences with Quantitative Affinity Measurements

It has been pointed out that a fundamental bottleneck in TFBS identification is the lack of quantitative binding affinity data for a large portion of transcription factors. The advancement of new high-throughput technologies such as ChIP-Chip, ChIP-Seq, and Protein Binding Microarray (PBM) has made it possible to determine the binding affinity of these TFs (i.e. each sequence can be associated with a binding intensity value) pmid19443739 . In particular, the PBM technology can enable us to enumerate all the possible k-mers, providing an unprecedentedly high resolution binding site affinity landscape for each TF. In light of this deluge of quantitative affinity data, robust probabilistic methods were developed to take into account those quantitative affinity data. Seed and Wobble has been proposed as a seed-based approach using rank statistics pmid16998473 . RankMotif++ was proposed to maximize the log likelihood of their probabilistic model of binding preferences pmid17646348 . MatrixREDUCE was proposed to perform forward variable selections to minimize the sum of squared deviations pmid16317069 . MDScan was proposed to combine two search strategies together, namely word enumeration and position-specific weight matrix updating pmid12101404 . PREGO was proposed to maximize the Spearman rank correlation between the predicted and the actual binding intensities pmid16809671 . Notably, Wong et al. have proposed and developed a hidden Markov model approach to learn the dependence between adjacent nucleotide positions rigorously; they also show that their method (kmerHMM) can deduce multiple binding modes for a given TF pmid23814189 .

Note that this paradigm is a generalization from the motif discovery for a set of sequences with binary measurements. For example, SeedSearch Barash:2001:SHA:645906.673098 and DME pmid15668401 . In other words, it also includes the discriminative motif discovery method in which a set of motif-containing sequences and a set of background sequences are given as the input since we can assign a value of 1 to each motif-containing sequence and 0 to each background sequence.

Motif Discovery for A Set of Orthologous Sequences

It is generally acknowledged that by comparing evolutionarily related DNA or protein sequences, functionally important sequences or motifs can be revealed by such comparison. A functional motif is assumed to be more conserved across different species than the background sequences 15534694 . By incorporating the evolutionary conservation with sequence-specific DNA binding affinity, different methods have been proposed. Moses et al. have proposed an extension of MEME to take the sequence evolution into account probabilistically pmid14992514 . Kellis et al. have proposed a spaced hexamer enumeration approach to identify conserved motifs pmid15285895 . FootPrinter has been proposed as a substring-parsimony based approach using dynamic programming to find statistically enriched motif substrings pmid12015878 . A Gibbs sampling approach named PhyloGibbs has also been proposed pmid16477324 .

Prediction (Motif Search)

After a motif model has been found, it is always desirable to apply it to search for motif instances over a given sequence (e.g. ChIP-Seq peak sequences). Some basic search methods have been developed to search motif instances over a sequence. Nonetheless, those methods do not have sufficient motif model complexity to distinguish false positives from true positives over a long sequence (e.g. 100k bp) pmid15131651 . To cope with that, some improvements have been made. In general, most of them utilize the biological information beyond the motif sequence specificity to augment the motif model complexity insufficiency. In particular, multiple motif information and evolutionary conservation have been readily adopted to improve the discovery accuracy mosesSurvey .

Basic Searches
Likelihood Ratio

Given a sequence of length and a PWM of the motif of width , we scan with a window of width such that the subsequences which likelihood ratio score is higher than a pre-specified threshold are considered the instances (hits) of the motif . Mathematically, a subsequence is considered as a motif instance (hit) if and only if the following condition is satisfied:

where is the th nucleotide among {A,C,G,T} and is the Iverson bracket. Nonetheless, different motifs may have different likelihood score distributions. It is difficult to set a single and fixed threshold which can work for all the motifs. To solve the problem, one can normalize the score to the interval [0,1] based on the maximal and minimal scores as follows:

Posterior Ratio

If we know the prior probability of motif occurrence , it can also be incorporated into the scoring function in a posterior manner mosesSurvey . Mathematically, given a sequence , motif model (including PFM and PWM ), and background distribution , we can compute the posterior ratio as follows:


It can be observed that the posterior ratio can be computed from the likelihood ratio by simply adding the logarithm of the prior probability ratio.


Given the previous scoring functions, it is not easy to set a threshold since they are just ratios. For example, if in the above example, it just means the likelihood that the sequence is generated by the motif model is higher than the background and vice versa. To justify it in a meaningful way, P-value distribution can be calculated from a motif model. Given a motif PWM of width , an exhaustive search can be applied to traverse all the possible sequences of width . Nonetheless, it takes time complexity for the DNA alphabet . Interestingly, if the PWM is of zero-order, we can exploit the column independence assumption and apply dynamic programming to calculate the exact P-value distribution in time complexity pmid2720468 . In practice, the empirical P-value distribution may also be used.

Nonetheless, the specificity of a PWM of width is still not high if it is applied to a very long sequence of length . Mathematically, even if we just assign the best match as the hit, hits are still expected (e.g. If and , hits are expected), assuming that the sequence is uniform in background nucleotide distribution. To solve the problem, people have spent efforts on incorporating more biological information to improve the motif search.

Incorporating Multiple Motif Information

To improve motif search, multiple motif information can be incorporated. Multiple motif sites are usually clustered together, resulting in higher signal-to-noise ratios which can be easier to be detected than alone. If multiple sites of the same motif are clustered together within a short distance, it is called homotypic clustering pmid12670999 . On the other hand, if multiple sites of different motifs are clustered together within a short distance, it is called heterotypic clustering.

To exploit the additional clustering signals beyond sequence specificity, MAST was proposed to multiply the P-values of multiple motif matches (hits) together, which has demonstrated superior performance in sequence homology search than the other two methods proposed in the same study pmid9672829 . CIS-ANALYST was proposed as a sliding window approach to predict the windows which have at least motif matches (hits) with pvalues pmid11805330 . Sinha et al. have proposed a probabilistic model, Stubb, to efficiently detect clusters of binding sites (i.e. cis-regulatory modules) over genomic scales using maximum likelihood estimation pmid16845069 . To determine the window size parameter, a window size adjustment procedure has been used in ClusterBuster to find clusters of motif matches pmid12824389 . Segal et al. have also derived an expectation maximization algorithm to model the clusters of motif matches as probabilistic graphical models pmid12855470 . Recently, Hermann et al. have proposed an integrative system (i-cisTarget) to combine the high-throughput next generation sequencing data with motif matches to provide accurate motif cluster search pmid22718975 . Notably, Zhou and Wong have shown that it is possible to search for clusters of motifs in a de novo way (i.e. without any given motif model and information) pmid15297614 .

Incorporating Evolutionary Conservation

Another approach to improve motif search is to incorporate evolutionary conservation. The rationale behind that is similar to that behind phylogenetic motif discovery which we have described in a previous section. Deleterious mutations will be removed from population by negative selection. To make use of that fact, we could imply that a true motif match should be more conserved across closely related species (For example, chimpanzee and mouse) than background sequences 15534694 . For instance, a windowing approach with several thresholds for motif matches and conservation, ConSite, was proposed by Sandelin et al. pmid15215389 . Nonetheless, it is limited to pair-wise analysis. rVISTA is a similar approach pmid16888351 using the Match program for motif matching in TRANSFAC TRANSFAC06 . Bayesian Branch Length Score (BBLS) was proposed as a evolutionary conservation score without relying on any multiple sequence alignment pmid19017655 . A parsimonious method for finding statistically significant k-mers with dynamic programming was proposed (FoorPrinter) pmid11997340 . Notably, Moses et al. proposed a comprehensive probabilistic model to search for motif instances with efficient p-value estimation (MONKEY) pmid15575972 .

To search for novel motif instances, there are programs aimed at searching for motif matches without any given motif model and information. For instance, Ovcharenko et al. have used likelihood ratio tests to distinguish conserved regions from the background pmid15173121 . Siepel et al. have demonstrated an approach in identifying conserved regions using hidden Markov models called PhastCons pmid16024819 .

Incorporating Both Approaches

Both motif clustering information and evolutionary conservation were demonstrated beneficial to motif search. Since they are independent of each other, it is straightforward to combine them. Philippakis et al. have proposed a method to combine both types of information (i.e. motif clustering and evolutionary conservation), achieving good performance on experimentally verified datasets pmid15759656 . MONKEY has been extended by Warner et al. to exploit the motif clustering information to predict motif clusters (PhylCRM) pmid18311145 . It has been reported that the misalignment errors of the input reference sequences from other species could affect the quality of phylogenetic footprinting for motif search. Thus statistical alignments have been used to assist motif search in EMMA pmid19293946 . Stub has also been extended to StubMS to take multiple species conservation into account using a HMM phylogenetic model pmid12855472 . Notably, a unified probabilistic framework which integrates multiple sequence alignment with binding site predictions, MORPH, was proposed by the same group pmid17997594 . Its effectiveness has been demonstrated and verified in an independent comparison study pmid21152003 .

2.2 Genome-wide DNA Binding Pattern Discovery

Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-Seq) measures the genome-wide occupancy of transcription factors in vivo. In a typical ChIP-Seq study, the first step is to call the peaks, i.e. determining the precise location in the genome where the TF binds. A number of peak calling tools have been developed; for instance, model-based analysis of ChIP-Seq data (MACS) was proposed to model the shift size of ChIP-Seq tags and local biases to improve its peak-calling accuracy pmid18798982 . Spp is another method with a strong focus on background signal correction pmid19029915 . PeakSeq is a two-pass strategy method. The first pass accounts for the sequence mappability while the second pass is to filter out statistically insignificant regions comparing to controls pmid19122651 . CisGenome refines peak boundaries and uses a conditional binomial model to identify peak regions pmid18978777 . However, recent benchmark studies suggest that their predicted peaks are distinct from each other pmid20628599 ; pmid20017957 .

Different combinations of DNA-binding protein occupancies may result in a gene being expressed in different tissues or at different developmental stages. To fully understand a gene’s function, it is essential to develop unsupervised learning models on multiple ChIP-Seq profiles to decipher the combinatorial regulatory mechanisms by multiple transcription factors.

Since multiple transcription factors often work in cis regulatory modules to confer complex gene regulatory programs, it is necessary to develop models on multiple ChIP-Seq datasets to decipher the combinatorial DNA-binding mechanism. In the following, we briefly review some of the previous works in this area. Gerstein et al. used pair-wise peak overlapping patterns to construct a human regulatory network pmid22955619

. Xie et al. proposed self organizing map methods to visualize the co-localization of DNA-binding proteins

pmid24243024 . Giannopoulou et al. proposed a non-negative matrix factorization to elucidate the clustering of DNA-binding proteins pmid23554462 . Zeng and colleagues proposed jMOSAiCS to discover histone modification patterns across multiple ChIP-Seq datasets pmid23844871 . Ferguson et al. have described a hierarchical Bayes approach to integrate multiple ChIP-Seq libraries to improve DNA binding event predictions. In particular, they have applied the method to histone ChIP-Seq libraries and predicted the gene locations associated with the expected pathways ferguson2012new . Mahony et al. also proposed a mixture model (MultiGPS) to detect differential binding enrichment of a DNA-binding protein in different cell lines, which can improve the protein’s DNA binding location predictions (i.e. Cdx2 protein in their study) mahony2014integrated . On the other hand, Chen et al. proposed a statistical framework (MM-ChIP) based on MACS to perform an integrative analysis of multiple ChIP datasets to predict ChIP-enriched regions with known motifs for a given DNA-binding protein (i.e. ER and CTCF proteins in their study) chen2011mm

. On the other hand, Ji et al. proposed a differential principal component analysis method on ChIP-Seq to perform unsupervised pattern discovery and statistical inference to identify differential protein-DNA interactions between two biological conditions

ji2013differential . Guo et al. described a generative probabilistic model (GEM) for high resolution DNA binding site discovery from ChIP data guo2012high . Interestingly, that model combines ChIP signals and DNA motif discovery together to achieve precise predictions of the DNA binding locations of a DNA-binding protein. The authors have further demonstrated how GEM can be applied to reveal spatially constrained transcription factor binding site pairs on a genome.

Despite the success of the methods described above, to fully understand a gene’s function, it is essential to develop probabilistic models on multiple ChIP-Seq profiles to decipher the genome-wide combinatorial patterns of DNA-binding protein occupancy. Unfortunately, the majority of the previous work usually focused on large-scale clustering of called peaks, which is an intuitive and straightforward approach. However such approaches have two limitations, as (i) peak-calling ignores the contributions from weak bindings of TFs, and (ii) pair-wise analysis ignores the complex combinatorial binding pattern among the TFs. Thus an unsupervised learning model called SignalSpider has been proposed to directly analyze multiple normalized ChIP-Seq signal profiles on all the promoter and enhancer regions quantitatively so that weak bindings can be taken into account wong2014signalspider ; pmid22955978 . Especially, its computational complexity has been carefully designed to scale with the increasing ChIP-Seq data (i.e. linear complexity). With such a linear complexity, the method (SignalSpider) has been successfully applied to more than 100 ChIP-Seq profiles in an integrated way, revealing different genome-wide DNA-binding modules across the entire human genome (hg19) wong2014signalspider .

3 Unsupervised Learning for inferring microRNA regulatory network

While transcription factors (TFs) are the major transcriptional regulator proteins, microRNA (miRNA), a small 22 nucleotide noncoding RNA species, has been shown to play a crucial role in post-transcriptional and/or translational regulation Bartel:2009fh . Since the 1993 discovery of the first miRNA let-7 in worms, a vast amount of studies have been dedicated to functionally characterizing miRNAs with a special emphasis on their roles in cancer. While TFs can serve either as a transcriptional activator or as a repressor, miRNAs are primarily known to confer mRNA degradation and/or translational repression by forming imperfect base-pair with the target sites primarily at the 3 untranslated regions of the messenger RNAs Guo:2010kh . While miRNAs are typically 22 nt long, several experimental studies combined with computational methods Lewis:2003uz ; Doench:2004ct ; Kiriakidou:2004kc ; Burgler:2005ge ; Lewis:2005cb ; Alexiou:2009jq ; Yue:2009jt have shown that only the first six or seven consecutive nucleotides starting at the second nucleotide from the 5 end of the miRNA are the most crucial determinants for target site recognition (Figure 2). Accordingly, the 6mer or 7mer close to the 5 region of the miRNA is termed as the “seed” region or seed. miRNAs that share common seeds belong to an miRNA family as they potentially target a vastly common set of mRNAs. Moreover, the target sites at the 3UTR Watson-Crick (WC) pairing with the miRNA seed are preferentially more conserved within mammalian or among all of the vertebrate species Lewis:2003uz . In humans, more than one third of the genes harbour sites under selective pressure to maintain their pairing to the miRNA seeds Lewis:2003uz ; Grimson:2007cy . An important variation around this seed-target pairing scheme was discovered by Lewis et al. (2005), where the target site is flanked by a conserved adenosine ‘A’ facing to the first nucleotide of the targeting miRNA Lewis:2005cb .

The dynamics of the miRNA regulatory network are implicated in various phenotypic changes including embryonic development and many other complex diseases Song:2006hm ; Croce:2009ff . Although abnormal miRNA expression can sometimes be taken as a stronger indicator of carcinoma in clinical samples than aberrant mRNA expression Ramaswamy:2001hc ; Lu:2005hx , the system level mechanistic effects are usually unclear. A single miRNA can potentially target 400 distinct genes, and there are thousands of distinct endogenous miRNAs in the human genome. Thus, miRNAs are likely involved in virtually all biological processes and pathways including carcinogenesis. However, functional characterizing miRNAs hinges on the accurate identification of their mRNA targets, which has been a challenging problem due to imperfect base-pairing and condition-specific miRNA regulatory dynamics. In this section, we discuss the current state-of-art approaches in referring miRNA or miRNA-mediated transcriptional regulatory network. Table 1 summarizes these methods. As we will see, each method is established through an effective unsupervised learning model by exploiting the static sequence-based information pertinent to the prior knowledge of miRNA targeting and/or the dynamic information of miRNA activities implicated by the recently available large data compendia, which interrogate genome-wide expression profiles of miRNAs and/or mRNAs across various cell conditions.

Figure 2: Canonical miRNA Watson-Crick base pairing to the 3UTR of the mRNA target site. The most critical region is a 6mer site termed as the “seed” occurs at the 2-7 position of the 5 end of the miRNA Lewis:2003uz . Three other variations centring at the 6mer seed are also known to be (more) conserved: 7mer-m8 site, a seed match + a Watson-Crick  match to miRNA nucleotide 8; 7mer-t1A site, a seed match + a downstream A in the 3UTR; 8mer, a seed match + both m8 and t1A. The site efficacy has also been proposed in the order of 8mer 7mer-m8 7mer- A1 6mer Friedman:2009km ; Grimson:2007cy . The abbreviations are: ORF, open reading frame; (NNNNN), the additional nucleotides to the shortest 19 nt miRNA; [AN], A or other nucleotides; Poly(A), polyadenylated tail.
Method Algorithm Section Ref
PicTar Hidden Markov Model 3.1 Krek:2005er
TargetScore Variational Bayesian Mixture Model 3.2 Li:2013wk
GroupMiR Nonparametric Bayesian with Indian Buffet Process 3.3 Le:2011wx
SNMNMF Constrained nonnegative matrix factorization 3.4 Zhang:2011ce
Mirsynergy Deterministic overlapping neighbourhood expansion 3.5 Li:2014fa

PicTar: Probabilistic identification of combinations of Target sites; GroupMiR: Group MiRNA target prediction; SNMNMF: Sparse Network-regularized Multiple Nonnegative Matrix Factorization; PIMiM: Protein Interaction-based MicroRNA Modules.

Table 1: Unsupervised learning methods reviewed in this chapter

3.1 PicTar

PicTar (Probabilistic identification of combinations of Target sites) is one of the few models that rigorously considers the combinatorial miRNA regulations on the same target 3UTR Krek:2005er . As an overview, PicTar first pre-filters target sites by their conservation across select species. However, the fundamental framework of PicTar is based on hidden Markov model (HMM) with a maximum likelihood (ML) approach, which is built on the logics of several earlier works from Siggia group Durbin:1998wz ; Bussemaker:2000dl ; Sinha:2003ui ; Rajewsky:2002tt . Among these works, PicTar was most inspired by “Ahab”, an HMM-based program developed (by the same group) to predict the combinatorial TF binding sites (TFBS) Rajewsky:2002tt . Although PicTar has been successfully applied to three studies on vertebrates Krek:2005er (where the original methodology paper was described), fly Grun:2005ix , and worm Lall:2006hh (where some improvements were described), the description of the core HMM algorithm of PicTar is rather brief. Here we will lay out the detailed technicality of the algorithm based on the information collected from several related works Durbin:1998wz ; Bussemaker:2000dl ; Sinha:2003ui ; Rajewsky:2002tt ; MBishop:2006uc , which will help highlight its strengths, limitations, and possible future extensions.

Let be a 3UTR sequence, the length of , and the target sites for miRNA “word” of length , the transition probability of the occurrence of miRNA , and the transition probability for the background of length , which is simply estimated from the fraction of A, U, G, C (i.e., Markov model of order 0) either in with length 300 nt or from all query UTRs. To simplify notation, the background letters are treated as a special word so that and . Thus, can be represented by multiple different ways of concatenating the segments corresponding to either miRNA target sites or background. The goal is to obtain at any arbitrary nucleotide position of the 3UTR sequence

the posterior probability

that is the last position of the word, where is the model parameters controlling emission probabilities (see below).

Following Markov’s assumption, is proportional to the products of the probabilities before and after position , which can be computed in time by Forward-Backward algorithm as described below. Formally,


where is the likelihood of sequence or the objective function to be maximized in the ML framework, and can be represented in recursion forms and computed via forward and backward algorithm, respectively in time for words. Formally, the forward algorithm is derived as follows:


where is the emission probability assumed known (see below) and is the transition probability to word . Note that is position independent. To start the recursion, .

The backward algorithm is similarly derived:


To start the backward recursion, , which is simply the frequency of word in the 3UTR sequence .

Given the emission probabilities, ’s (transition probability) are the only parameters that need to be set in order to maximize . Following the ML solution,


where the posterior is calculated by Eq (15), which is in turn computed by forward-backward algorithm. Finally, the likelihood (objective) function is evaluated by a simple forward pass to the end of the sequence:


Together, the optimization of in PicTar is performed using Baum-Welch algorithm for Expectation-Maximization (EM) as summarized in Algorithm 1. Finally, the PicTar score is defined as a log ratio () of the ML over background likelihood :


where is the likelihood when only background is considered.

Initialize and
E-step: Forward recursion (): compute by Eq (21) Backward recursion (): compute by Eq (25)
M-step: Update by Eq (26)
Evaluate likelihood: Compute by Eq (28)
Repeat EM steps until increases by less than a cutoff
Algorithm 1 Baum-Welch HMM algorithm in PicTar Krek:2005er

As previously mentioned, the emission probabilities in PicTar are assumed known. In Ahab for modelling TFBS, or is based on the position frequency matrix (PFM): , where is the normalized frequency of the nucleotide at the position of the PFM. However, miRNAs do not have PFM. The original PicTar arbitrarily sets to be 0.8 if there is perfect seed-match at 1-7 or 2-8 nt positions of the miRNA 5 end AND the free binding energy as estimated by RNAhybrid is no less than 33% of the optimal binding energy of the entire miRNA sequence to the UTR Lorenz:2011it ; otherwise is set to 0.2 divided by for imperfect seed matches with only 1 mismatch allowed, provided it is above 66% of the optimal binding energy. Thus, the setting highly disfavours imperfect seed match. The later version of PicTar changes the emission probability calculation to be the total number of occurrences in conserved 3UTR sites divided by the total number of sites 3UTR Lall:2006hh . The setting appears to improve the sensitive/specificity of the model but makes it more dependent on the cross-species conservation, potentially prone to false negatives (for the non-conserved but functional sites).

The major advantage of PicTar over other simpler methods is that the coordinate actions of the miRNAs (synergistic in case of optimally spaced sites or antagonistic in case of overlapping binding sites) are naturally captured within the emission and transition probabilities. For instance, the as the joint ML of multiple miRNA target sites will be higher than the linear sum of individual miRNA target sites (i.e., synergistic effects). Longer 3UTR will score less than shorter 3UTR if both contain the same number of target sites. PicTar demonstrated a comparable signal:noise ratio relative to TargetScan and was compared favourably with some of the earlier published methods based on several surveys Alexiou:2009jq ; Yue:2009jt ; Sethupathy:2006df ; Li:2013wk . When applied to vertebrates, PicTar identified roughly 200 genes per miRNA, which is a rather conservative estimate compared to the recent findings by TargetScan with a conserved targeting scoring approach Friedman:2009km . When applied to C. elegans (worm), PicTar identified 10% of the C. elegans genes that are under conserved miRNA regulation.

Nonetheless, PicTar has three important limitations. First, PicTar does not consider the correlation between miRNA target sites since is essentially position independent. This is perhaps largely due to the increased model complexity when considering all pairwise transition probabilities between miRNAs and background since there will be parameters to model (as apposed to only ). Supported by Grimson:2007cy , however, the specific spatial arrangement of the target sites may be functionally important. In particular, the optimal distance between miRNA sites was estimated as 8 to

40 nt based on transfection followed by regression analysis

Grimson:2007cy . Although unsupported by experimental evidence, the ordering of some specific target sites may be also important. For instance, target site must be located before target site (for the same or different miRNA) to achieve optimal synergistic repression. The model that takes into account the spatial correlation between motifs is called the hcHMM (history conscious HMM) implemented in a program called Stubb for detecting TF binding sites (TFBS) rather than miRNA target predictions Sinha:2006bn .

Second, the ML approach is prone to local optimal especially for long UTRs or many coordinated miRNA actions considered simultaneously (i.e., many ’s). An alternative HMM formulation is to impose Bayesian priors on the HMM parameters MacKay97ensemblelearning . In particular, Wu:2008fq

demonstrated such Bayesian formalism of HMM in modelling combinatorial TFBS. In their model so-called “module HMM”, the transition probabilities is assumed to follow a Dirichlet distribution with hyperparameters

, where corresponds to the background, to the TFs, and to the background inside of cis-regulatory module. Inference is performed via Markov Chain Monte Carlo (MCMC) procedure or Gibbs sampling in particular. Briefly, a forward pass and backward pass are run to generate marginal probabilities at each nucleotide position. Starting at the end of the sequence, hidden states are sampled at each position based on the marginals until reaching the front of the sequence. Given the hidden states, the hyperparameters for Dirichlet distribution of the transition probabilities are then updated by simply counting the occurrences of each state. The posteriors of the hidden states at each nucleotide position are inferred as the averaged number of times the states are sampled in 1000 samplings, and the states with the maximum a posteriori (MAP) are chosen. The model is demonstrated to perform better than Ahab and Stubb (which are the basis for PicTar) in TFBS predictions but have not yet been adapted to miRNA target predictions.

Third, since data for expression profiling of mRNAs and miRNAs by microarrays or RNA-seq is now rather abundant (e.g., Babak:2004ec , GSE1738; GSE31568, Keller:2011jk ; GSE40499, Meunier:2012dv from GEO) or ENCODE (GSE24565, Djebali:2013hc ) or TCGA CancerGenomeAtlasResearchNetwork:2008gr ; Weinstein:2013jp , the combinatorial regulation needs to be revisited by taking into account whether or not the co-operative miRNAs are indeed expressed in vivo and/or the expression correlation between mRNA and miRNA. In particular, the emission probabilities need to be redefined to integrate both sequence-based and expression-based information.

3.2 A probabilistic approach to explore human miRNA target repertoire by integrating miRNA-overexpression data and sequence information

One of the most direct way to query the targets of a given miRNA is by transfecting the miRNA into a cell and examine the expression changes of the cognate target genes Lim:2005cd . Presumably, a bona fide target will exhibit decreased expression upon the miRNA transfection. In particular, overexpression of miRNA coupled with expression profiling of mRNA by either microarray or RNA-seq has proved to be a promising approach Lim:2005cd ; Arvey:2010ev . Consequently, genome-wide comparison of differential gene expression holds a new promise to elucidate the global impact of a specific miRNA regulation without solely relying on evolutionary conservation. However, miRNA transfection is prone to off-target effects. For instance, overexpressing a miRNA may not only repress the expression of its direct targets but also cause a cascading repression of in-direct targets of the affected transcription activators. To improve prediction accuracy for direct miRNA targets, this chapter describes a novel model called TargetScore that integrates expression change due to miRNA overexpression and sequence information such as context score Grimson:2007cy ; Garcia:2011ec and other orthogonal sequence-based features such as conservation Friedman:2009km into a probabilistic score.

In one of our recent papers, we described a novel probabilistic method for miRNA target prediction problem by integrating miRNA-overexpression data and sequence-based scores from other prediction methods Li:2013wk

. Briefly, each score feature is considered as an independent observed variable, which is the input to a Variational Bayesian-Gaussian Mixture Model (VB-GMM). We chose a Bayesian over a maximum likelihood approach to avoid overfitting. Specifically, given expression fold-change (due to miRNA transfection), we use a three-component VB-GMM to infer down-regulated targets accounting for genes with little or positive fold-change (due to off-target effects

Khan:2009ki ). Otherwise, two-component VB-GMM is applied to unsigned sequence scores. The parameters for the VB-GMM are optimized using Variational Bayesian Expectation-Maximization (VB-EM) algorithm. The mixture component with the largest absolute means of observed negative fold-change or sequence score is associated with miRNA targets and denoted as “target component”. The other components correspond to the “background component”. It follows that inferring miRNA-mRNA interactions is equivalent to inferring the posterior distribution of the target component given the observed variables. The targetScore is computed as the sigmoid-transformed fold-change weighted by the averaged posteriors of target components over all of the features.

Bayesian mixture model

Assuming there are genes, we denote as the expression fold-change () or sequence scores (). Thus, for sets of sequence scores, . To simplify the following equations, we use to represent one of the independent variables without loss of generality. To infer target genes for a miRNA given , we need to obtain the posterior distribution of the latent variable , where =3 (=2) for modelling signed (unsigned) scores such as logarithmic fold-changes (sequence scores).

We follow the standard Bayesian GMM based on MBishop:2006p485 (p474-482) with only minor modifications. Although univariate GMM () is applied to each variable separately, we implemented and describe the following formalism as a more general multivariate GMM, allowing modeling the covariance matrices. Briefly, the latent variables are sampled at probabilities (mixing coefficient), that follow a Dirichlet prior with hyperparameters . To account for the relative frequency of targets and non-targets for any miRNA, we set the (associated with the target component) to and other , where (by default). Assuming

follows a Gaussian distribution

, where (precision matrix) is the inverse covariance matrix, together follow a Gaussian-Wishart prior , where the hyperparameters .

Variational Bayesian Expectation Maximization

Let . The marginal log likelihood can be written in terms of lower bound

(first term) and Kullback-Leibler divergence

(second term):


where is a proposed distribution for , which does not have a closed form distribution. Because is a constant, maximizing implies minimizing . The general optimal solution is the expectation of variable w.r.t other variables, . In particular, we define . The expectations for the three terms (at log scale), namely , have the same forms as the initial distributions due to the conjugacy of the priors. However, they require evaluation of the parameters , which in turn all depend on the expectations of or the posterior of interest:


where . The inter-dependence between the expectations and model parameters falls naturally into an EM framework, namely VB-EM. Briefly, we first initialize the model parameters based on priors and randomly sample data points . At the iteration, we evaluate (31) using the model parameters (VB-E step) and update the model parameters using (31) (VB-M step). The EM iteration terminates when improves by less than (default). Please refer to MBishop:2006p485 for more details.


We define the targetScore as an integrative probabilistic score of a gene being a target (meaning that for the target component ) of a miRNA:


where , is the posterior in (31).

TargetScore demonstrates superior statistical power compared to existing methods in predicting validated miRNA targets in various human cell-lines. Moreover, the confidence targets from TargetScore exhibit comparable protein downregulation and are more significantly enriched for Gene Ontology terms. Using TargetScore, we explored oncomir-oncogenes network and predicted several potential cancer-related miRNA-messenger RNA interactions. TargetScore is available at Bioconductor http://www.bioconductor.org/packages/devel/bioc/html/TargetScore.html.

Network-based methods to detect miRNA regulatory modules

Although targets of individual miRNAs are significantly enriched for certain biological processes Papadopoulos:2009kt ; Tsang:2010jv , it is also likely that multiple miRNAs are coordinated together to synergistically regulate one or more pathways Krek:2005er ; Boross:2009gx ; Xu:2011kl . Indeed, despite their limited number (2578 mature miRNAs in human genome, miRBase V20, Kozomara:2014he ), miRNAs may be in charge of more evolutionarily robust and potent regulatory effects through coordinated collective actions. The hypothesis of miRNA synergism is also parsimonious or biologically plausible because the number of possible combinations of the 2578 human miRNAs is extremely large, enough to potentially react to virtually countless environmental changes. Intuitively, if a group of (miRNA) workers perform similar tasks together, then removing a single worker will not be as detrimental as assigning each worker a unique task Boross:2009gx .

Several related methods have been developed to study miRNA synergism. Some early methods were based on pairwise overlaps Shalgi:2007fe or score-specific correlation Xu:2011kl between predicted target sites of any given two (co-expressed) miRNAs. For instance, Shalgi et al. (2007) devised an overlapping scoring scheme to account for differential 3UTR lengths of the miRNA targets, which may otherwise bias the results if standard hypergeometric test was used Shalgi:2007fe . Methods beyond pairwise overlaps have also been described. These methods considered not only the sequence-based miRNA-target site information but also the respective miRNA-mRNA expression correlation (MiMEC) across various conditions to detect miRNA regulatory modules (MiRMs).

For instance, Joung et al. (2007) developed a probabilistic search procedure to separately sample from the mRNA and miRNA pools candidate module members with probabilities proportional to their overall frequency of being chosen as the “fittest”, which was determined by their target sites and MiMEC relative to the counterparts Joung:2007kj . The algorithm found only the single best MiRM, which varied depending on the initial mRNA and miRNA set. Other network-based methods using either the sequence information only or using m/miRNA expression profiles only as a filter for a more disease-focused network construction on only the differentially expressed (DE) m/miRNAs. For instance, Peng et al.(2006) employed an enumeration approach to search for maximal bi-clique on DE m/miRNAs to discover complete bipartite subgraphs, where every miRNA is connected with every mRNA Peng:2009js . The approach operated on unweighted edges only, which required discretizing miRNA-mRNA expression correlation. Also, maximal bi-clique does not necessarily imply functional MiRMs and vice versa.

The following subsections review in details three recently developed network methods (Table 1) to detect MiRMs. Despite distinct unsupervised learning frameworks, all three methods exploit the widely available paired m/miRNA expression profiles to improve upon the accuracy of earlier developed (sequence-based) network approaches.

3.3 GroupMiR: inferring miRNA and mRNA group memberships with Indian Buffet Process

The expression-based methods reviewed elsewhere Yue:2009jt were essentially designed to explain the expression of each mRNA in isolation using a subset of the miRNA expression in a linear model with a fixed set of parameters. However, the same mRNAs (miRNAs) may interact with different sets of miRNAs (mRNAs) in different pathways. The exact number of pathways is unknown and may grow with an increase of size or quality of the training data. Thus, it is more natural to infer the number of common features shared among different groups of miRNAs and mRNAs. Accordingly, Le et al. (2011) proposed a powerful alternative model called GroupMiR (Group MiRNA target prediction) Le:2011wx . As an overview, GroupMiR first explored the latent binary features or memberships possessed within mRNAs, miRNAs, or shared between mRNAs and miRNAs on a potentially infinite binary feature space empowered by a nonparametric Bayesian (NBP) formalism. Thus, the number of features was inferred rather than determined arbitrarily. Importantly, the feature assignment took into account the prior information for miRNA and mRNA targeting relationships, obtained from sequence-based target prediction tools such as TargetScan or PicTar. Based on the shared memberships, mRNAs and/or miRNAs formed groups (or clubs). The same miRNAs (mRNAs) could possess multiple memberships and thus belong to multiple groups each corresponding to a latent feature. This was also biologically plausible since a miRNA (mRNA) may participate in several biological processes. Similar to GenMiR++ Huang:2007em

, GroupMiR then performed a Bayesian linear regression on each mRNA expression using

all miRNA expression but placing more weight on the expression of miRNAs that shared one or more common features with that mRNA.

Specifically, the framework of GroupMiR was based on a recently developed general nonparametric Bayesian prior called the Indian Buffet Process (IBP) Griffiths:2005tp (which was later on proved to be equivalent to Beta process Thibaux:2007wn ). As the name suggests, IBP can be understood from an analogy of a type of an ‘Indian buffet’ as follows. A finite number of customers or objects form a line to enter one after another a buffet comprised of dishes or features. Each customer samples dishes selected by previous customers, and Poisson() new dishes, where is a model parameter. The choices of customers on the dishes are expressed in an binary matrix . A left-order function maps a binary matrix to a left-ordered binary matrix with columns (i.e., dishes) sorted from left to right by decreasing order of and breaking ties in favour of customers who enter the buffet earlier. This process defines an exchangeable distribution on equivalence class comprising all of the that have the same left-ordered binary matrix regardless of the order the customers enter the buffet (i.e., row order) or the dish order (i.e., column order).

Before reviewing the IBP derivation, we need to establish some notations Griffiths:2005tp . () denotes the history of feature at object , which is encoded by a single decimal number. At object , for instance, a feature has one of four histories encoded by corresponding to all of the four possible permutations of choices for objects 1 and 2: . Accordingly, for objects, there are histories for each feature and histories excluding the history of all zeros (i.e., ). Additionally, denotes the number of features possessing the same history , for all features with , and for all features for which . Thus, . It is easy to see that binary matrices belong to an equivalence class if and only if they have the same history profile for each feature . The cardinality () of an equivalence class is the number of all of the binary matrices with the same history profile:


As shown below, Eq (33) is essential in order to establish the close-formed solution of IBP prior when leads to an infinite feature space or infinite number of columns in . After establishing the above properties, the central steps in deriving the IBP prior used in GroupMiR is reviewed below. I will focus on some steps neglected from the original work and refer the reader to the full derivation when appropriate. As in the original papers, we first derive IBP on a finite number of latent features and then take the limit making use of Eq (33).

Let be an binary matrix, where for mRNAs and miRNAs, and is the number of latent features. Assuming the binary value in for each feature is sampled from and are conditionally independent given

, the joint distribution of

is then:


where follows a Beta prior with :


where is a Beta function. To take into account the prior information between miRNA and mRNA targeting from sequence-based predictions, GroupMiR incorporated in Eq (35) an weight matrix :


where interaction within mRNAs and within miRNAs were set to zeros and interaction between mRNA and miRNA followed the scoring matrix obtained from a quantitative sequence-based predictions. In particular, MiRanda scores were used in their paper. Thus, is either 0 or defined as a pairwise potential of interactions between mRNA and miRNA . The modified was then defined as:


where and the partition function were defined as:


The marginal probability of is derived by integrating out as follows:


where in Eq (43) is the sum over all , and (45) directly follows the definition of Beta function. However, when since the probability of sampling a specific binary matrix from an infinite number of matrices is 0. Instead, the inference was performed over the equivalence class with the number of -equivalent matrices defined above:




A more elaborate derivation of (48) was described in the Appendix from Le:2011wx and omitted here. Additionally, the authors also showed that when or equivalently for all histories , then Eq (48) reduces to the original IBP introduced in Griffiths:2005tp , which is thus a special case of the weighted IBP in GroupMiR.

Given the IBP prior Eq (48), the generative process for corresponding to an existing feature (where ) was derived as follows: