Source materials for theses and selected study assignments
Understanding functional organization of genetic information is a major challenge in modern biology. Following the initial publication of the human genome sequence in 2001, advances in high-throughput measurement technologies and efficient sharing of research material through community databases have opened up new views to the study of living organisms and the structure of life. In this thesis, novel computational strategies have been developed to investigate a key functional layer of genetic information, the human transcriptome, which regulates the function of living cells through protein synthesis. The key contributions of the thesis are general exploratory tools for high-throughput data analysis that have provided new insights to cell-biological networks, cancer mechanisms and other aspects of genome function. A central challenge in functional genomics is that high-dimensional genomic observations are associated with high levels of complex and largely unknown sources of variation. By combining statistical evidence across multiple measurement sources and the wealth of background information in genomic data repositories it has been possible to solve some the uncertainties associated with individual observations and to identify functional mechanisms that could not be detected based on individual measurement sources. Statistical learning and probabilistic models provide a natural framework for such modeling tasks. Open source implementations of the key methodological contributions have been released to facilitate further adoption of the developed methods by the research community.READ FULL TEXT VIEW PDF
Source materials for theses and selected study assignments
This thesis consists of an overview and of the following publications which are referred to in the text by their Roman numerals.
Laura L. Elo, Leo Lahti, Heli Skottman, Minna Kyläniemi, Riitta Lahesmaa, and Tero Aittokallio. Integrating probe-level expression changes across generations of Affymetrix arrays. Nucleic Acids Research, 33(22):e193, 2005.
Leo Lahti, Laura L. Elo, Tero Aittokallio, and Samuel Kaski. Probabilistic analysis of probe reliability in differential gene expression studies with short oligonucleotide arrays. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(1):217–225, 2011.
Leo Lahti, Juha E.A. Knuuttila, and Samuel Kaski. Global modeling of transcriptional responses in interaction networks. Bioinformatics, 26(21):2713–2720, 2010.
Leo Lahti, Samuel Myllykangas, Sakari Knuutila, and Samuel
Dependency detection with similarity constraints.
In Tülay Adali, Jocelyn Chanussot, Christian Jutten,
and Jan Larsen, editors, Proceedings of the 2009 IEEE
International Workshop on Machine Learning for Signal Processing
Proceedings of the 2009 IEEE International Workshop on Machine Learning for Signal Processing XIX, pages 89–94. IEEE, Piscataway, NJ, 2009.
Janne Sinkkonen, Janne Nikkilä, Leo Lahti, and Samuel Kaski. Associative clustering. In Boulicaut, Esposito, Giannotti, and Pedreschi (editors), Machine Learning: ECML2004 (Proceedings of the ECML’04, 15th European Conference on Machine Learning), Lecture Notes in Computer Science 3201, 396–406. Springer, Berlin, 2004.
Samuel Kaski, Janne Nikkilä, Janne Sinkkonen, Leo Lahti, Juha E.A. Knuuttila, and Cristophe Roos. Associative clustering for exploring dependencies between functional genomics data sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics: Special Issue on Machine Learning for Bioinformatics – Part 2, 2(3):203–216, 2005.
The publications in this thesis have been a joint effort of all authors; key contributions by the author of this thesis are summarized below.
Publication 1 introduces a novel analysis strategy to improve the accuracy and reproducibility of the measurements in genome-wide transcriptional profiling studies. A central part of the approach is the utilization of side information in external genome sequence databases. The author participated in the design of the study, suggested the utilization of external sequence data, implemented this, as well as participated in preparing the manuscript.
Publication 2 provides a probabilistic framework for probe-level gene expression analysis. The model combines statistical power across multiple microarray experiments, and is shown to outperform widely-used preprocessing methods in differential gene expression analysis. The model provides tools to assess probe performance, which can potentially help to improve probe and microarray design. The author had a major role in designing the study. The author derived the formulation, implemented the model, performed the probe-level experiments, as well as coordinated the manuscript preparation. The author prepared an accompanied open source implementation which has been published in BioConductor, a reviewed open source repository for computational biology algorithms.
Publication 3 introduces a novel approach for organism-wide modeling of transcriptional activity in genome-wide interaction networks. The method provides tools to analyze large collections of genome-wide transcriptional profiling data. The author had a major role in designing the study. The author implemented the algorithm, performed the experiments, as well as coordinated the manuscript preparation. The author participated in and supervised the preparation of an accompanied open source implementation in BioConductor.
Publication 4 introduces a regularized dependency modeling framework with particular applications in cancer genomics. The author had a major role in formulating the biomedical modeling task, and in designing the study. The theoretical model was jointly developed by the author and S. Kaski. The author derived and implemented the model, carried out the experiments, and coordinated the manuscript preparation. The author supervised and participated in the preparation of an accompanied open source implementation in BioConductor.
Publication 5 introduces the associative clustering principle, which is a novel data integration framework for dependency detection with direct applications in functional genomics. The author participated in implementation of the method, had the main responsibility in designing and performing the functional genomics experiments, as well as participated in preparing the manuscript.
Publication 6 contains the most extensive treatment of the associative clustering principle. In addition to presenting detailed theoretical considerations, this work introduces new sensitivity analysis of the results, and provides a comprehensive validation in bioinformatics case studies. The author participated in designing the experiments, performed the comparative functional genomics experiments and technical validation, as well as participated in preparing the manuscript.
In this thesis boldface symbols are used to denote matrices and vectors. Capital symbols () signify matrices and lowercase symbols () column vectors. Normal lowercase symbols indicate scalar variables.
Data matrices ()
, Data samples, vectors in
, Scalars in
Probability or probability density of
Norm of a matrix or vector
Mutual information between random variables and
Beta distribution with parameters and
Dirichlet distribution with parameter vector
Inverse Gamma distribution with parametersand
EM Expectation – Maximization algorithm
KL–divergence Kullback-Leibler divergence
MCMC Markov chain Monte Carlo
This thesis introduces computational strategies for genome- and organism-wide analysis of the human transcriptome. The thesis provides novel tools (i) to increase the reliability of high-throughput microarray measurements by combining statistical evidence from genome sequence databases and across multiple microarray experiments, (ii) to model context-specific transcriptional activation patterns of genome-scale interaction networks across normal human body by using background information of genetic interactions to guide the analysis, and (iii) to integrate measurements of the human transcriptome to other layers of genomic information with novel dependency modeling techniques for co-occurring data sources. The three strategies address widely recognized challenges in functional genomics (Collins et al., 2003; Troyanskaya, 2005).
Obtaining reliable measurements is the crucial starting point for any data analysis task. The first contribution of this thesis is to develop computational strategies that utilize side information in genomic sequence and microarray data collections in order to reduce noise and improve the quality of high-throughput observations. Publication 1 introduces a probe-level strategy for microarray preprocessing, where updated genomic sequence databases are used in order to remove erroneously targeted probes to reduce measurement noise. The work is extended in Publication 2
, which introduces a principled probabilistic framework for probe-level analysis. A generative model for probe-level observations combines evidence across multiple experiments, and allows the estimation of probe performance directly from microarray measurements. The model detects a large number of unreliable probes contaminated by known probe-level error sources, as well as many poorly performing probes where the source of contamination is unknown and could not be controlled based on existing probe-level information. The model provides a principled framework to incorporate prior information of probe performance. The introduced algorithms outperform widely used alternatives in differential gene expression studies.
A novel strategy for organism-wide analysis of transcriptional activity in genome-scale interaction networks in Publication 3 forms the second main contribution of this thesis. The method searches for local regions in a network exhibiting coordinated transcriptional response in a subset of conditions. Constraints derived from genomic interaction databases are used to focus the modeling on those parts of the data that are supported by known or potential interactions between the genes. Nonparametric inference is used to detect a number of physiologically coherent and reproducible transcriptional responses, as well as context-specific regulation of the genes. The findings provide a global view on transcriptional activity in cell-biological networks and functional relatedness between tissues.
The third contribution of the thesis is to integrate measurements of the human transcriptome to other layers of genomic information. Novel dependency modeling techniques for co-occurrence data are used to reveal regularities and interactions, which could not be detected in individual observations. The regularized dependency modeling framework of Publication 4 is used to detect associations between chromosomal mutations and transcriptional activity. Prior biological knowledge is used to constrain the latent variable model and shown to improve cancer gene detection performance. The associative clustering, introduced in Publications 5 and 6, provides tools to investigate evolutionary divergence of transcriptional activity.
Open source implementations of the key methodological contributions of this thesis have been released in order to guarantee wide access to the developed algorithmic tools and to comply with the emerging standards of transparency and reproducibility in computational science, where an increasing proportion of research details are embedded in code and data accompanying traditional publications (Boulesteix, 2010; Carey and Stodden, 2010; Ioannidis et al., 2009) and transparent sharing of these resources can form valuable contributions to public knowledge (Sommer, 2010; Sonnenburg et al., 2007; Stodden, 2010).
The thesis is organized as follows: In Chapter 2, there is an overview of functional genomics, related measurement techniques, and genomic data resources. General methodological background, in particular of exploratory data analysis and the probabilistic modeling paradigm, is provided in Chapter 3. The methodological contributions of the thesis are presented in Chapters 4-6. In Chapter 4, strategies to improve the reliability of high-throughput microarray measurements are presented. In Chapter 5 methods for organism-wide analysis of the transcriptome are considered. In Chapter 6, two general-purpose algorithms for dependency modeling are introduced and applied in investigating functional effects of chromosomal mutations and evolutionary divergence of transcriptional activity. The conclusions of the thesis are summarized in Chapter 7.
Cells are fundamental building blocks of living organisms. All known life forms maintain a carbon-based cellular form that carries the genetic program (Alberts et al., 2002). Each cell carries a copy of the heritable genetic code, the genome. The human genome is divided in 23 pairs of chromosomes, located in the nucleus of the cell, as well as in additional mitochondrial genome. Chromosomes are macroscopic deoxyribonucleic acid (DNA) molecules in which the DNA is wrapped around histone molecules and packed into a peculiar chromatin structure that will ultimately constitute chromosomes. The genetic code in the DNA consists of four nucleotides: adenosine (A), thymine (T), guanine (G), and cytosine (C). In ribonucleic acid (RNA), the thymine is replaced by uracil (U). Ordering of the nucleotides carries genetic information. Nucleic acid sequences have a peculiar base pairing property, where only A-T/U and G-C pairs can hybridize with each other. This leads to the well-known double-stranded structure of the DNA, and forms the basis for cellular information processing. The central dogma of molecular biology (Crick, 1970) states that DNA encodes the information to construct proteins through the irreversible process of protein synthesis. This is a central paradigm in molecular biology, describing the functional organization of life at the cellular level.
Genes are basic units of genetic information. The gene is a sequence of DNA that contains the information to manufacture a protein or a set of related proteins. Genetic variation and regulation of gene activity has therefore major phenotypic consequences. The regulatory region and coding sequence are two key elements of a gene. The regulatory region regulates gene activity, while the coding sequence carries the instructions for protein synthesis (Alberts et al., 2002). Interestingly, the concept of a gene remains controversial despite comprehensive identification of the protein-coding genes in the human genome and detailed knowledge of their structure and function (Pearson, 2006).
Proteins, encoded by the genes, are key functional entities in the cell. They form cellular structures, and participate in cell signaling and functional regulation. Protein synthesis refers to the cell-biological process that converts genetic information into final functional protein products (Figure 2.1A). Key steps in protein synthesis include transcription, pre-mRNA splicing, and translation. In transcription, the double-stranded DNA is opened in a proximity of the gene sequence and the process is initiated on the regulatory region of the gene. The DNA sequence of the gene is then converted into a complementary pre-mRNA by a polymerase enzyme. The pre-mRNA sequence contains both protein coding and non-coding segments. These are called exons and introns, respectively. In pre-mRNA splicing, the introns are removed and the exons are joined together to form mature messenger-RNA (mRNA). A gene can encode multiple splice variants, corresponding to different exon definitions and their combinations; this is called alternative splicing. The mature mRNA is exported from nucleus to the cell cytoplasm. In translation the mRNA is converted into a corresponding amino acid sequence in ribosomes based on the universal genetic code that defines a mapping between nucleic acid triplets, so-called codons, and amino acids. The code is common for all known life forms. Each consecutive codon on the mRNA sequence corresponds to an amino acid, and the corresponding sequence of amino acids constitutes a protein. In the final stage of protein synthesis, the amino acid sequence folds into a three-dimensional structure and undergoes post-translational modifications. The structural characteristics of a protein molecule will ultimately determine its functional properties (Alberts et al., 2002).
Phenotypic changes can rarely be attributed to changes in individual genes; cell function is ultimately determined by coordinated activation of genes and other biomolecular entities in response to changes in cell-biological environment (Hartwell et al., 1999). Gene activity is regulated at all levels of protein synthesis and cellular processes. A major portion of functional genome sequence and protein coding-genes themselves participate in the regulatory system itself (Lauffenburger, 2000).
Epigenetic regulation refers to chemical and structural modifications of chromosomal DNA, the chromatin, for instance through methylation, acetylation, and other histone-binding molecules. Such modifications affect the packing of the DNA molecule around histones in the cell nucleus. The combinatorial regulation of such modifications regulates access to the gene sequences (Gibney and Nolan, 2010). Epigenetic changes are believed to be heritable and they constitute a major source of variation at individual and population level (Johnson and Tricker, 2010). Transcriptional regulation is the next major regulatory layer in protein synthesis. So-called transcription factor proteins can regulate the transcription rate by binding to control elements in gene regulatory region in a combinatorial fashion. Post-transcriptional modifications will then regulate pre-mRNA splicing. Up to 95% of human multi-exon genes are estimated to have alternative splice variants (Pan et al., 2008). Consequently, a variety of related proteins can be encoded by a single gene. This contributes to the structural and functional diversity of cell function (Stetefeld and Ruegg, 2005). Several mechanisms will then affect mRNA degradation rates. For instance, micro-RNAs that are small, 21-25 basepair nucleotide sequences can inactivate specific mRNA transcripts through complementary base pairing, leading to mRNA degradation, or prevention of translation. Finally, post-translational modifications, protein degradation, and other mechanisms will affect the three-dimensional structure and life cycle of a protein. The proteins will participate in further cell-biological processes. The processes are in continuous interaction and form complex functional networks, which regulate the life processes of an organism (Alberts et al., 2002).
The understanding of the structure and functional organization of the genome is rapidly accumulating with the developing genome-scanning technologies and computational methods. This section provides an overview to key structural and functional layers of the human genome.
The genome is a dynamic structure, organized and regulated at multiple levels of resolution from individual nucleotide base pairs to complete chromosomes (Figure 2.1B; Brown (2006)). A major portion of heritable variation between individuals has been attributed to differences in the genomic DNA sequence. Traditionally, main genetic variation was believed to arise from small point mutations, so-called single-nucleotide polymorphisms (SNPs), in protein-coding DNA. Recently, it has been increasingly recognized that structural variation of the genome makes a remarkable contribution to genetic variation. Structural variation is observed at all levels of organization from single-nucleotide polymorphisms to large chromosomal rearrangements, including deletions, insertions, duplications, copy-number variants, inversions and translocations of genomic regions (Feuk et al., 2006; Sharp et al., 2006). Such modifications can directly and indirectly influence transcriptional activity and contribute to human diversity and health (Collins et al., 2003; Hurles et al., 2008).
The draft DNA sequence of the complete human genome was published in 2001 (International human genome sequencing consortium, 2001; Venter et al., 2001). The human genome contains three billion base pairs and approximately 20,000-25,000 protein-coding genes (International Human Genome Sequencing Consortium, 2004). The protein-coding exons comprise less than 1.5% of the human genome sequence. Approximately 5% of the human genome sequence has been conserved in evolution for more than 200 million years, including the majority of protein-coding genes (The ENCODE Project Consortium, 2007; Mouse Genome Sequencing Consortium, 2002). Half of the genome consists of highly repetitive sequences. The genome sequence contains structural elements such as centromeres and telomeres, repetitive and mobile elements, (Prak and Kazazian Jr., 2000), retroelements (Bannert and Kurth, 2004), and non-coding, non-repetitive DNA (Collins et al., 2003). The functional role of intergenic DNA, which forms 75% of the genome, is to a large extent unknown (Venter et al., 2001). Recent evidence suggests that the three-dimensional organization of the chromosomes, which is to a large extent regulated by the intergenic DNA is under active selection, can have a remarkable regulatory role (Lieberman-Aiden et al., 2009; Parker et al., 2009). Comparison of the human genome with other organisms, such as the mouse (Mouse Genome Sequencing Consortium, 2002) can highlight important evolutionary differences between species. For a comprehensive review of the structural properties of the human genome, see Brown (2006).
In protein synthesis, the gene sequence is transcribed into pre-mRNA, which is then further modified into mature messenger-RNA and transported to cytoplasm. An average cell contains over 300,000 mRNA molecules, and the mRNA concentration, or expression levels of individual genes, vary according to Zipf’s law, a power-law distribution where most genes are expressed at low concentrations, perhaps only one or few copies of the mRNA per cell on average, and a small number of genes are highly expressed, potentially with thousands of copies per cell (see Carninci, 2009; Furusawa and Kaneko, 2003). Cell-biological processes are reflected at the transcriptional level. Transcriptional activity varies by cell type, environmental conditions and time. Different collections of genes are active in different contexts. Gene expression, or mRNA expression, refers to the expression level of an mRNA transcript at particular physiological condition and time point. In addition to protein-coding mRNA molecules that are the main target of analysis in this thesis, the cell contains a variety of other functional and non-functional mRNA transcripts, for instance micro-RNAs, ribosomal RNA and transfer-RNA molecules (Carninci, 2009; Johnson et al., 2005).
The transcriptome refers to the complete collection of mRNA sequences of an organism. This is a central functional layer of the genome that regulates protein production in the cells, with a significant role in creating genetic variation (Jordan et al., 2005). According to current estimates, up to 90% of the eukaryotic genome can be transcribed (Consortium, 2005; Gagneur et al., 2009). The protein-coding mRNA transcripts are translated into proteins at ribosomes during protein synthesis.
The proteome refers to the collection of protein products of an organism. The proteome is a main functional layer of the genome. Since the final protein products carry out a main portion of the actual cell functions, techniques for monitoring the concentrations of all proteins and their modified forms in a cell simultaneously would significantly help to improve the understanding of the cellular systems (Collins et al., 2003). However, sensitive, reliable and cost-efficient genome-wide screening techniques for measuring protein expression are currently not available. Therefore genome-wide measurements of the mRNA expression levels are often used as an indirect estimate of protein activity.
In addition to the DNA, RNA and proteins, the cell contains a variety of other small molecules. The extreme functional diversity of living organisms emerges from the complex network of interactions between the biomolecular entities (Barabási and Oltvai, 2004; Hartwell et al., 1999). Understanding of these networks and their functional properties is crucial in understanding cell function (Collins et al., 2003; Schadt, 2009). However, the systemic properties of the interactome are poorly characterized and understood due to the complexity of biological phenomena and incomplete information concerning the interactions. The cell-biological processes are inherently modular (Hartwell et al., 1999; Ihmels et al., 2002; Lauffenburger, 2000), and they exhibit complex pathway cross-talk between the cell-biological processes (Li et al., 2008). In modular systems, small changes can have significant regulatory effects (Espinosa-Soto and Wagner, 2010).
Systematic observations from the various functional and regulatory layers of the genome are needed to understand cell-biological systems. Efficient sharing and integration of genomic information resources through digital media has enabled large-scale investigations that no single institution could afford. The public human genome sequencing project (International human genome sequencing consortium, 2001) is a prime example of such project. Results from genome-wide transcriptional profiling studies are routinely deposited to public repositories (Barrett et al., 2009; Parkinson et al., 2009). Sharing of original data is increasingly accepted as the scientific norm, often following explicit data release policies. The establishment of large-scale databases and standards for representing biological information support the efficient use of these resources (Bammler et al., 2005; Brazma et al., 2006). A continuously increasing array of genomic information is available in these databases, concerning aspects of genomic variability across individuals, disease states, and species (Brent, 2008; Church, 2005; Cochrane and Galperin, 2010; G10KCOS consortium, 2009; The Cancer Genome Atlas Research Network, 2008).
During the human genome project and preceding sequencing projects DNA sequence reads were among the first sources of biological data that were collected in large-scale public repositories, such as GenBank (Benson et al., 2010). GenBank contains comprehensive sequence information of genomic DNA and RNA for a number of organisms, as well as a variety of information concerning the genes, non-coding regions, disease associations, variation and other genomic features. Online analysis tools, such as the Ensembl Genome browser (Flicek et al., 2010), facilitate efficient use of these annotation resources. Next-generation sequencing technologies provide rapidly increasing sequencing capacity to investigate sequence variation between individuals, populations and disease states (Ledford, 2010; McPherson, 2009). In particular, the human and mouse transcriptome sequence collections at the Entrez Nucleotide database of GenBank are utilized in this thesis, in Publications 1 and 2.
Gene expression measurement provides a snapshot of mRNA transcript levels in a cell population at a specific time and condition, reflecting the activation patterns of the various cell-biological processes. While gene expression measurements provide only an indirect view to cellular processes, their wide availability provides a unique resource for investigating gene co-regulation on a genome- and organism-wide scale. Versatile collections of microarray data in public repositories, such as the Gene Expression Omnibus (GEO; Barrett et al. (2009)) and ArrayExpress (Parkinson et al., 2009) are available for human and model organisms, and they contain valuable information of cell function (Consortium, 2005; DeRisi et al., 1997; Russ and Futschik, 2010; Zhang et al., 2004).
Several techniques are available for quantitative and highly parallel measurements of mRNA or gene expression, allowing the measurement of the expression levels of tens of thousands of mRNA transcripts simultaneously (Bradford et al., 2010). Microarray techniques are routinely used to measure the expression levels of tens of thousands of mRNA transcripts in a given sample, and transcriptional profiling is currently a main high-throughput technique used to investigate gene function at genome- and organism-wide scale (Gershon, 2005; Yauk et al., 2004). Increasing amounts of transcriptional profiling data are being produced by sequencing-based methods (Carninci, 2009). A main difference between the microarray- and sequencing-based techniques is that gene expression arrays have been designed to measure predefined mRNA transcripts, whereas sequencing-based methods do not require prior information of the measured sequences, and enable de novo discovery of expressed transcripts (Bradford et al., 2010; ’t Hoen et al., 2008). Large-scale microarray repositories provide currently the most mature tools for data processing and retrieval, and form the main source of transcriptome data in this thesis.
Microarray technology is based on the base pairing property of nucleic acid sequences where the DNA or RNA sequences in a sample bind to the complementary nucleotide sequences on the array. This is called hybridization. The measurement process begins by the collection of cell samples and isolation of the sample mRNA. The isolated mRNA is converted to cDNA, labeled with specific marker molecules, and hybridized on complementary probe sequences on the array. The array surface may contain hundreds of thousands of spots, each containing specific probe sequences designed to uniquely match with particular mRNA sequences. The hybridization level reflects the target mRNA concentration in the sample, and it is estimated by measuring the intensity of light emitted by the label molecules with a laser scanner. Short oligonucleotide arrays (Lockhart et al., 1996) are among the most widely used microarray technologies, and they are the main source of mRNA expression data in this thesis. Short oligonucleotide arrays utilize multiple, typically 10-20, probes for each transcript target that bind to different regions of the same transcript sequence. Use of several 25-nucleotide probes for each target leads to more robust estimates of transcript activity. Each probe is expected to uniquely hybridize with its intended target, and the detected hybridization level is used as a measure of the activity of the transcript. A short oligonucleotide array measures absolute expression levels of the mRNA sequences; relative differences between conditions can be investigated afterwards by comparing these measurements. A standard whole-genome array measures typically 20,000-50,000 unique transcript sequences. A single microarray experiment can therefore produce hundreds of thousands of raw observations.
Comparison and integration of individual microarray experiments is often challenging due to remarkable experimental variation between the experiments. Common standards have been developed to advance the comparison and integration (Brazma et al., 2001, 2006). Carefully controlled integrative datasets, so-called gene expression atlases, contain thousands of genome-wide measurements of transcriptional activity across diverse conditions in a directly comparable format. Examples of such data collections include GeneSapiens (Kilpinen et al., 2008), the human gene expression atlas of the European Bioinformatics Institute (Lukk et al., 2010), as well as the NCI-60 cell line panel (Scherf et al., 2000). Integrative analysis of large and versatile transcriptome collections can provide a holistic view of transcriptional activity of the various cell-biological processes, and opens up possibilities to discover previously uncharacterized cellular mechanisms that contribute to human health and disease.
Microarray techniques can also be used to study other functional aspects of the genome, including epigenetics and micro-RNA regulation, chromosomal aberrations and polymorphisms, alternative splicing, as well as transcription factor binding (Butte, 2002; Hoheisel, 2006). For instance, chromosomal aberrations can be measured with the array comparative genome hybridization method (aCGH; Pinkel and Albertson 2005), which is based on hybridization of DNA sequences on the array surface. Copy number changes are a particular type of chromosomal aberrations, which are a major mechanism for cancer development and progression. Copy number alterations can cause changes in gene- and micro-RNA expression, and ultimately cell-biological processes (Beroukhim et al., 2010). A public repository of copy number measurement data is provided for instance by the CanGEM database (Scheinin et al., 2008). In Publication 4, microarray measurements of DNA copy number changes are integrated with transcriptional profiling data to discover potential cancer genes for further biomedical analysis.
Curated information concerning cell-biological processes is valuable in both experimental design and validation of computational studies (Blake, 2004). Representation of dynamic biochemical reactions in their full richness is a challenging task beyond a mere listing of biochemical events; a variety of proteins and other compounds interact in a hierarchical manner through various molecular mechanisms (Hartwell et al., 1999; Przytycka et al., 2010). Standardized database formats such as the BioPAX (BioPAX workgroup, 2005) and SBML (Strömbäck and Lambrix, 2005) advance the accumulation of highly structured biological knowledge and automated analysis of such data. A huge body of information concerning cell-biological processes is available in public repositories. The most widely used annotation resources include the Gene Ontology (GO) database (Ashburner et al., 2000) and the KEGG pathway database (Kanehisa et al., 2010). The GO database provides functional annotations for genes and can be used for instance to detect enrichment of certain functional categories among the key findings from computational analysis, as in Publication 6, where enrichment analysis is used for both validation and interpretation purposes. Pathways are more structured representations concerning cellular processes and interactions between molecular entities. Such prior information can be used to guide computational modeling, as in Publication 3, where pathway information derived from the KEGG pathway database is used to guide organism-wide discovery and analysis of transcriptional response patterns.
The collective knowledge about genome organization and function is constantly updated and refined by improved measurement techniques and accumulation of data (Sebat, 2007). This can alter the analysis and interpretation of results from large-scale genomic screens. For instance, evolving gene and transcript definitions are known to significantly affect microarray interpretation. Probe design on microarray technology relies on sequence annotations that may have changed significantly after the original array design. Reinterpretation of microarray data based on updated probe annotations has been shown to improve the accuracy and comparability of microarray results (Dai et al., 2005; Hwang et al., 2004; Mecham et al., 2004b). Bioinformatics studies routinely take into account updates in genome version, genome build, in new analyses. The constantly refined biological data highlights the need to account for this uncertainty in computational analyses. In Publications 1 and 2, explicit computational strategies that are robust against evolving transcript definitions are developed for microarray data analysis.
High-throughput genetic screens are inherently noisy. Controlling all potential sources of variation in the measurement process is increasingly difficult when automated measurement techniques can produce millions of data points in a single experiment, concerning extremely complex living systems that are to a large extent poorly understood.
Noise arises from both technical and biological sources (Butte, 2002), and systematic variation between laboratories, measurement batches and measurement platforms has to be taken into account when combining the results across individual studies (Heber and Sick, 2006; MAQC Consortium, 2006). Moreover, genomic knowledge is constantly evolving, which can potentially change the interpretation of previous experiments (see e.g. Dai et al., 2005). The various sources of noise and uncertainty in microarray studies are discussed in more detail in Chapter 4.
High dimensionality of the data and small sample size form another challenge for the analysis of high-throughput functional genomics data. Tens of thousands of transcripts can be measured simultaneously in a single microarray experiment, which greatly exceeds the number of available samples in most biomedical studies. Small sample sizes leave considerable uncertainty in the analyses; few observations contain very limited information concerning the complex and high-dimensional phenomena and potential interactions between different parts of the system. Overfitting of the models and the problem of multiple testing forms considerable challenges in such situations. While automated analysis methods can generate thousands of hypotheses concerning the system, prioritizing the findings and characterizing uncertainty in the predictions become central issues in the analysis. The curse of dimensionality, coupled with the high levels of noise in functional genomics studies, is therefore posing particular challenges for computational modeling (Saeys et al., 2007).
The challenges in controlling the various sources of uncertainty have led to remarkable problems in reproducing microarray results (Ioannidis et al., 2009), but maturing technology and the development of common standards and analytical procedures are constantly improving the reliability of high-throughput screens (Allison et al., 2006; Reimers, 2010; MAQC Consortium, 2006). The models developed in this thesis combine statistical evidence across related experiments to improve the reliability of the analysis and to increase modeling power. Generative probabilistic models provide a rigorous framework for handling noise and uncertainty in the data and models.
Genomic variation between individuals has remarkable and to a large extent unknown contribution to health and disease susceptibility. Large-scale characterization of the variability between individuals and populations is expected to elucidate genomic mechanisms associated with disease, as well as to lead to the discovery of novel medical treatments. High-throughput genomics can provide new tools to understand disease mechanisms (Braga-Neto and Marques, 2006; Lage et al., 2008), to ’hack the genome’ (Evanko, 2006) to treat diseases (Volinia et al., 2010), and to guide personalized therapies that take into account the individual variability in sensitivity and responses to treatments (Church, 2005; Downward, 2006; Foekens et al., 2008; Ocana and Pandiella, 2010; van ’t Veer and Bernards, 2008). Disease signatures are potentially robust across tissues and experiments (Dudley et al., 2009; Hu et al., 2006). Genomic screens have revealed new disease subtypes (Bhattacharjee et al., 2001), and led to the discovery of various diagnostic (Lee et al., 2008; Su et al., 2009; Tibshirani et al., 2002) and prognostic (Beer et al., 2002) biomarkers. Diseases cause coordinated changes in gene activity through biomolecular networks (Cabusora et al., 2005). Integration of chemical, genomic and pharmacological functional genomics data can also help to predict new drug targets and responses (Lamb et al., 2006; Yamanishi et al., 2010). Genomic mutations can also affect genome function and cause diseases (Taylor et al., 2008). Cancer is an example of a prevalent genomic disease. Boveri (1914) discovered that cancer cells have chromosomal imbalances, and since then the understanding of genomic changes associated with cancer has continuously improved (Stratton et al., 2009; Wunderlich, 2007). For instance, many human micro-RNA genes are located at cancer-associated genomic regions and are functionally altered in cancers (see Calin and Croce, 2006). Genomic changes also affect transcriptional activity of the genes (Myllykangas et al., 2008). Publication 4 introduces a novel computational approach for screening cancer-associated DNA mutations with functional implications by genome-wide integration of chromosomal aberrations and transcriptional activity.
This chapter has provided an overview to central modeling challenges and research topics in functional genomics. In the following chapters, particular methodological approaches are introduced to solve research tasks in large-scale analysis of the human transcriptome. In particular, methods are introduced to increase the reliability of high-throughput measurements, to model large-scale collections of transcriptome data and to integrate transcriptional profiling data to other layers of genomic information. The next chapter provides general methodological background for these studies.
Understanding requires generalization beyond particular observations. While empirical observations contain information of the underlying process that generated the data, a major challenge in computational modeling is that empirical data is always finite and contains only limited information of the system. Traditional statistical models are based on careful hypothesis formulation and systematic collection of data to support or reject a given hypothesis. However, successful hypothesis formulation may require substantial prior knowledge. When minimal knowledge of the system is available, there is a need for exploratory methods that can recognize complex patterns and extract features from empirical data in an automated way (Baldi and Brunak, 1999). This is a central challenge in computational biology, where the investigated systems are extremely complex and contain large amounts of poorly characterized and uncontrolled sources of variation. Moreover, the data of genomic systems is often very limited and incomplete. General-purpose algorithms that can learn relevant features from the data with minimal assumptions are therefore needed, and they provide valuable tools in functional genomics studies. Classical examples of such exploratory methods include clustering, classification and visualization techniques. The extracted features can provide hypotheses for more detailed experimental testing and reveal new, unexpected findings. In this work, general-purpose exploratory tools are developed for central modeling tasks in functional genomics.
Let us start by defining some of the basic concepts and terminology. Data set in this thesis refers to a finite collection of observations, or samples. In experimental studies, as in biology, a sample typically refers to the particular object of study, for instance a patient or a tissue sample. In computational studies, sample refers to a numerical observation, or a subset of observations, represented by a numerical feature vector. Each element of the feature vector describes a particular feature of the observation. Given features and samples, the data set is presented as a matrix , where each column vector represents a sample and each row corresponds to a particular feature. The features can represent for instance different experimental conditions, time points, or particular summaries about the observations. This is the general structure of the observations investigated in this work.
The observations are modeled in terms of probability densities; the samples are modeled as independent instances of a random variable. A central modeling task is to characterize the underlying probability density of the observations, . This defines a topology in the sample space and provides the basis for generalization beyond empirical observations. As explained in more detail in Section 3.2, the models are formulated in terms of observations , model parameters , and latent variables that are not directly observed, but characterize the underlying process that generated the data.
Ultimately, all models describe relationships between objects. Similarity is therefore a key concept in data analysis; the basis for characterizing the relations, for summarizing the observations, and for predicting future events. Measures of similarity can be defined for different classes of objects such as feature vectors, data sets, or random variables. Similarity in general is a vague concept. Euclidean distance, induced by the Euclidean metrics, is a common (dis-)similarity measure for multivariate observations. Correlation is a standard choice for univariate random variables. Mutual information is an information-theoretic measure of statistical dependency between two random variables, characterizing the decrease in the uncertainty concerning the realization of one variable, given the other one. The uncertainty of a random variable is measured in terms of entropy111Entropy is defined as for a continuous variable. (Shannon, 1948). The mutual information between two random variables is then given by (see e.g. Gelman et al., 2003). The Kullback-Leibler divergence, or KL–divergence
, is a closely related non-symmetric dissimilarity measure for probability distributions, defined as (see e.g. Bishop, 2006). Mutual information between two random variables can be alternatively formulated as the KL–divergence between their joint density and the product of their independent marginal densities, , which gives the connection . Mutual information and KL-divergence are central information-theoretic measures of dependency employed in the models of this thesis.
It is important to notice that measures of similarity are inherently coupled to the statistical representation of data and to the goals of the analysis; different representations can reveal different relationships between observations. For instance, the Euclidean distance is sensitive to scaling of the features; representation in natural or logarithmic scale, or with different units can potentially lead to very different analysis results. Not all measures are equally sensitive; mutual information can naturally detect non-linear relationships, and it is invariant to the scale of the variables. On the other hand, estimating mutual information is computationally demanding.
refers to computational techniques for selecting, scaling and transforming the data into a suitable form for further analysis. Feature selection has a central role in data analysis, and it is implicitly present in all analysis tasks in selecting the investigated features for the analysis.
There are no universally optimal stand-alone feature selection techniques, since the problem is inherently entangled with the analysis task and multiple equally optimal feature sets may be available for instance in classification or prediction tasks Guyon and Elisseeff (2003); Saeys et al. (2007). Successful feature selection can reduce the dimensionality of the data with minimal loss of relevant information, and focus the analysis on particular features. This can reduce model complexity, which is expected to yield more efficient, generalizable and interpretable models. Feature selection is particularly important in genome-wide profiling studies, where the dimensionality of the data is large compared to the number of available samples, and only a small number of features are relevant for the studied phenomenon. This is also known as the large p, small n problem (West, 2003)
. Advanced feature selection techniques can take into account dependencies between the features, consider weighted combinations of them, and can be designed to interact with the more general modeling task, as for instance in the nearest shrunken centroids classifier ofTibshirani et al. (2002). The constrained subspace clustering model of Publication 3 can be viewed as a feature selection procedure, where high-dimensional genomic observations are decomposed into distinct feature subsets, each of which reveals different relationships of the samples. In Publication 4, identification of maximally informative features between two data sets forms a central part of a regularized dependency modeling framework. In Publications 3-4 the procedure and representations are motivated by biological reasoning and analysis goals.
Exploratory data analysis refers to the use of computational techniques to summarize and visualize data in order to facilitate the generation of new hypotheses for further study when the search space would be otherwise exhaustively large (Tukey, 1977). The analysis strategy takes the observations as the starting point for discovering interesting regularities and novel research hypotheses for poorly characterized large-scale systems without prior knowledge. The analysis can then proceed from general observations of the data toward confirmatory data analysis, more detailed investigations and hypotheses that can be tested in independent data sets with standard statistical procedures. Exploratory data analysis differs from traditional hypothesis testing where the hypothesis is given. Light-weight exploratory tools are particularly useful with large data sets when prior knowledge on the system is minimal. Standard exploratory approaches in computational biology include for instance clustering, classification and visualization techniques (Evanko, 2010; Polanski and Kimmel, 2007).
Cluster analysis refers to a versatile family of methods that partition data into internally homogeneous groups of similar data points, and often at the same time minimize the similarity between distinct clusters. Clustering techniques enable class discovery from the data. This differs from classification where the target is to assign new observations into known classes. The partitions provided by clustering can be nested, partially overlapping or mutually exclusive, and many clustering methods generalize the partitioning to cover previously unseen data points (Jain and Dubes, 1988)
. Clustering can provide compressed representations of the data based on a shared parametric representation of the observations within each cluster, as for instance in K-means or Gaussian mixture modeling(see e.g. Bishop, 2006)
. Certain clustering approaches, such as the hierarchical clustering(see e.g. Hastie et al., 2009)
, apply recursive schemes that partition the data into internally homogeneous groups without providing a parametric representation of the clusters. Cluster structure can be also discovered by linear algebraic operations on the distance matrices, as for instance in spectral clustering. The different approaches often have close theoretical connections. Clustering in general is an ill-defined concept that refers to a set of related but mutually incompatible objectives(Ben-David and Ackerman, 2008; Kleinberg, 2002)
. Cluster analysis has been tremendously popular in computational biology, and a comprehensive review of the different applications are beyond the scope of this thesis. It has been observed, for instance, that genes with related functions have often similar expression profiles and are clustered together, suggesting that clustering can be used to formulate hypotheses concerning the function of previously uncharacterized genes(DeRisi et al., 1997; Eisen et al., 1998), or to discover novel cancer subtypes with biomedical implications (Sørlie et al., 2001).
are another widely used exploratory approach in computational biology. Visualizations can provide compact and intuitive summaries of complex, high-dimensional observations on a lower-dimensional display, for instance by linear projection methods such as principal component analysis, or by explicitly optimizing a lower-dimensional representation as in the self-organizing map(Kohonen, 1982). Visualization can provide the first step in investigating large data sets (Evanko, 2010).
Statistical learning refers to computational models that can learn to recognize structure and patterns from empirical data in an automated way. Unsupervised and supervised models form two main categories of learning algorithms.
approaches seek compact descriptions of the data without prior knowledge. In probabilistic modeling, unsupervised learning can be formulated as the task of finding a probability distribution that describes the observed data and generalizes to new observations. This is also calleddensity estimation. The parameter values of the model can be used to provide compact representations of the data. Examples of unsupervised analysis tasks include methods for clustering, visualization and dimensionality reduction. In cluster analysis, groups of similar observations are sought from the data. Dimensionality reduction techniques provide compact lower-dimensional representations of the original data, which is often useful for subsequent modeling steps. Not all observations of the data are equally valuable, and assessing the relevance of the observed regularities is problematic in fully unsupervised analysis.
In supervised learning the task is to learn a function that maps the inputs to the desired outputs
based on a set of training examples in a generalizable fashion, as in regression for continuous outputs, and classification for discrete output variables. The supervised learning tasks are inherently asymmetric; the inference proceeds from inputs to outputs, and prior information of the modeling task is used to supervise the analysis; the training examples also include a desired output of the model.
The models developed in this thesis can be viewed as unsupervised exploratory techniques. However, the distinction between supervised and unsupervised models is not strict, and the models in this thesis borrow ideas from both categories. The models in Publications 2-3 are unsupervised algorithms that utilize prior information derived from background databases to guide the modeling by constraining the solutions. However, since no desired outputs are available for these models, the modeling tasks differ from supervised analysis. The dependency modeling algorithms of Publications 4-6 have close theoretical connections to the supervised learning task. In contrast to supervised learning, the learning task in these algorithms is symmetric; modeling of the co-occurring data sets is unsupervised, but coupled. Each data set affects the modeling of the other data set in a symmetric fashion, and, in analogy to supervised learning, prediction can then proceed to either direction. Compared to supervised analysis tasks, the emphasis in the dependency detection algorithms introduced in this thesis is in the discovery and characterization of symmetric dependencies, rather than in the construction of asymmetric predictive models.
The main contributions of this thesis follow the generative probabilistic modeling paradigm. Generative probabilistic models describe the observed data in terms of probability distributions. This allows the calculation of expectations, variances and other standard summaries of the model parameters, and at the same time allows to describe the independence assumptions and relations between variables, and uncertainty in the modeling process in an explicit manner. Measurements are regarded as noisy observations of the general, underlying processes; generative models are used to characterize the processes that generated the observations.
The first task in modeling is the selection of a model family - a set of potential formal representations of the data. As discussed in Section 3.2.2, the representations can also to some extent be learned from the data. The second task is to define the objective function, or cost function, which is used to measure the descriptive power of the models. The third task is to identify the optimal model within the model family that best describes the observed data with respect to the objective function. This is called learning or model fitting. The details of the modeling process are largely determined by the exact modeling task and particular nature of the observations. The objectives of the modeling task are encoded in the selected model family, the objective function and to some extent to the model fitting procedure. The model family determines the space of possible descriptions for the data and has therefore a major influence on the final solution. The objective function can be used to prefer simple models or other aspects in the modeling process. The model fitting procedure affects the efficiency and accuracy of the learning process. For further information of these and related concepts, see Bishop (2006). A general overview of the probabilistic modeling framework is given in the remainder of this section.
Generative probabilistic models view the observations as random samples from an underlying probability distribution. The model defines a probability distribution over the feature space. The model can be parameterized by model parameters that specify a particular model within the model family. For convenience, we assume that the model family is given, and leave it out from the notation. In this thesis, the appropriate model families are selected based on biological hypotheses and analysis goals. Generative models allow efficient representation of dependencies between variables, independence assumptions and uncertainty in the inference (Koller and Friedman, 2009). Let us next consider central analysis tasks in generative modeling.
Classical probability distributions provide well-justified and convenient tools for probabilistic modeling, but in many practical situations the observed regularities in the data cannot be described with a single standard distribution. However, a sufficiently rich mixture of standard distributions can provide arbitrarily accurate approximations of the observed data. In mixture models, a set of distinct, latent processes, or components, is used to describe the observations. The task is to identify and characterize the components and their associations to the individual observations. The standard formulation assumes independent and identically distributed observations where each observation has been generated by exactly one component. In a standard mixture model the overall probability density of the data is modeled as a weighted sum of component distributions:
where the components are indexed by , and . Each mixture component can have a different distributional form. The mixing proportion, or weight, and model parameters of each component are denoted by and , respectively, with . Many applications utilize convenient standard distributions, such as Gaussians, or other distributions from the exponential family. Then the mixture model can be learned for instance with the EM algorithm described in Section 3.3.1.
In practice, the mixing proportions of the components are often unknown. The mixing proportions can be estimated from the data by considering them as standard model parameters to be fitted with a ML estimate. However, the procedure is potentially prone to overfitting and local optima, i.e., it may learn to describe the training data well, but fails to generalize to new observations. An alternative, probabilistic way to determine the weights is to treat the mixing proportions as latent variables with a prior distribution . A standard choice is a symmetric Dirichlet prior222Dirichlet distribution is the probability density where the multivariate random variable and the positive parameter vector have their elements indexed by , , and . . This gives an equal prior weight for each component and guarantees the standard exchangeability assumption of the mixture component labels. A label determines cluster identity. Intuitively, exchangeability corresponds to the assumption that the analysis is invariant to the ordering of the data samples and mixture components. Compared to standard mixture models, probabilistic mixture models have increased computational complexity.
Further prior knowledge can be incorporated in the model by defining prior distributions for the other parameters of the mixture model. This can also be used to regularize the learning process to avoid overfitting. A typical prior distribution for the components of a Gaussian mixture model, parameterized by , is the normal-inverse-Gamma prior (see e.g. Gelman et al., 2003).
Interpreting the mixture components as clusters provides an alternative, probabilistic formulation of the clustering task. This has made probabilistic mixture models a popular choice in the analysis of functional genomics data sets that typically have high dimensionality but small sample size. Probabilistic analysis takes the uncertainties into account in a rigorous manner, which is particularly useful when the sample size is small. The number of mixture components is often unknown in practical modeling tasks, however, and has to be inferred based on the data. A straightforward solution can be obtained by employing a sufficiently large number of components in learning the mixture model, and selecting the components having non-zero weights as a post-processing step. An alternative, model-based treatment for learning the number of mixture components from the data is provided by infinite mixture models considered in Section 3.2.2.
The observed variables are often affected by latent variables that describe relevant structure in the model, but are not directly observed. The latent variable values can be, to some extent, inferred based on the observed variables. Combination of latent and observed variables allows the description of complex probability spaces in terms of simple component distributions and their relations. Use of simple component distributions can provide an intuitive and computationally tractable characterization of complex generative processes underlying the observations.
A generative latent variable model specifies the distributional form and relationships of the latent and observed variables. As a simple example, consider the probabilistic interpretation of probabilistic component analysis (PCA), where the observations are modeled with a linear model where a normally distributed latent variable is transformed with the parameter matrix and isotropic Gaussian noise () is assumed on the observations. More complex models can be constructed by analogous reasoning. A complete-data likelihood defines a joint density for the observed and latent variables. Only a subset of variables in the model is typically of interest for the actual analysis task. For instance, the latent variables may be central for describing the generative process of the data, but their actual values may be irrelevant. Such variables are called nuisance variables. Their integration, or marginalization, provides probabilistic averaging over the potential realizations. Marginalization over the latent variables in the complete-data likelihood gives the likelihood
Marginalization over the latent variables collapses the modeling task to finding optimal values for model parameters , in a way that takes into account the uncertainty in latent variable values. This can reduce the number of variables in the learning phase, yield more straightforward and robust inferences, as well as speed up computation. However, marginalization may lead to analytically intractable integrals. As certain latent variables may be directly relevant, marginalization depends on the overall goals of the analysis and may cover only a subset of the latent variables. In this thesis latent variables are utilized for instance in Publication 3, which treats the sample-cluster assignments as discrete latent variables, as well as in Publication 4, where a regularized latent variable model is introduced to model dependencies between co-occurring observations.
Finite mixture models and latent variable models require the specification of model structure prior to the analysis. This can be problematic since for instance the number and distributional shape of the generative processes is unknown in many practical tasks. However, the model structure can also to some extent be learned from the data. Non-parametric models provide principled approaches to learn the model structure from the data. In contrast to parametric models, the number and use of the parameters in nonparametric models is flexible(see e.g. Hjort et al., 2010; Müller and Quintana, 2004). The infinite mixture of Gaussians, used as a part of the modeling process in Publication 3
, is an example of a non-parametric model where both the number of components, as well as mixture proportions of the component distributions are inferred from the data. Learning of Bayesian network structure is another example of nonparametric inference, where relations between the model variables are learned from the data(see e.g. Friedman, 2003). While more complex models can describe the training data more accurately, an increasing model complexity needs to be penalized to avoid overfitting and to ensure generalizability of the model.
Nonparametric models provide flexible and theoretically principled approaches for data-driven exploratory analysis. However, the flexibility often comes with an increased computational cost, and the models are potentially more prone to overfitting than less flexible parametric models. Moreover, complex models can be difficult to interpret.
Many nonparametric probabilistic models are defined by using the theory of stochastic processes to impose priors over potential model structures. Stochastic processes can be used to define priors over function spaces. For instance, the Dirichlet process (DP) defines a probability density over the function space of Dirichlet distributions333If is a distribution drawn from a Dirichlet process with the probability measure over the sample space, , then each finite partition of the sample space is distributed as .. The Chinese Restaurant Process (CRP) provides an intuitive description of the Dirichlet process in the cluster analysis context. The CRP defines a prior distribution over the number of clusters and their size distribution. The CRP is a random process in which customers arrive in a restaurant, which has an infinite number of tables. The process goes as follows: The first customer chooses the first table. Each subsequent customer will select a table based on the state of the restaurant tables after customers have arrived. The new customer will select a previously occupied table with a probability which is proportional to the number of customers seated at table , i.e. . Alternatively, the new customer will select an empty table with a probability which is proportional to a constant . The model prefers tables with a larger number of customers, and is analogous to clustering, where the customers and tables correspond to samples and clusters, respectively. This provides an intuitive prior distribution for clustering tasks. The prior prefers compact models with relatively few clusters, but the number of clusters is potentially infinite, and ultimately determined based on the data.
Infinite mixture models are a general class of nonparametric methods where the number of mixture components are determined in a data-driven manner; the number of components is potentially infinite (see e.g. Müller and Quintana, 2004; Rasmussen, 2000). An infinite mixture is obtained by letting in the finite mixture model of Equation 3.1 and replacing the Dirichlet distribution prior of the mixing proportions by a Dirichlet process. The formal probability distribution of the Dirichlet process can be intuitively derived with the so-called stick-breaking presentation. Consider a unit length stick and a stick-breaking process, where the breakpoint is stochastically determined, following the beta distribution , where tunes the expected breaking point. The process can be viewed as consecutively breaking off portions of a unit length stick to obtain an infinite sequence of stick lengths ; , with (Ishwaran and James, 2001). This defines the probability distribution over potential partitionings of the unit stick. A truncated stick-breaking representation considers only the first elements. Setting the prior , defined by the stick-breaking representation in Equation 3.1 assigns a prior on the number of mixture components and their mixing proportions that are ultimately learned from the observed data. The prior helps to find a compromise between increasing model complexity and likelihood of the observations.
Traditional approaches used to determine the number mixture components are based on objective functions that penalize increasing model complexity, for instance in certain variants of the K-means or in spectral clustering (see e.g. Hastie et al., 2009). Other model selection criteria include cross-validation and comparison of the models in terms of their likelihood or various information-theoretic criteria that seek a compromise between model complexity and fit (see e.g. Gelman et al., 2003). However, the sample size may be insufficient for such approaches, and the models may lack a rigorous framework to account for uncertainties in the observations and model parameters. Modeling uncertainty in the parameters while learning the model structure can lead to more robust inference in nonparametric probabilistic models but also adds inherent computational complexity in the learning process.
The term ’Bayesian’ refers to interpretation of model parameters as variables. The uncertainty over the parameter values, arising from limited empirical evidence, is described in terms of probability distributions. This is in contrast to the traditional view where parameters have fixed values with no distribution and the uncertainty is ignored. The Bayesian approach leads to a learning task where the objective is to estimate the posterior distribution of the model parameters , given the observations . The posterior is given by the Bayes’ rule (Bayes, 1763):
The two key elements of the posterior are the likelihood and the prior. The likelihood describes the probability of the observations, given the parameter values . The parameters can also characterize alternative model structures. The prior encodes prior beliefs about the model and rewards solutions that match with the prior assumptions or yield simpler models. Such regularizing properties can be particularly useful when training data is scarce and there is considerable uncertainty in the parameter estimates. With strong, informative priors, new observations have little effect on the posterior. In the limit of large sample size the posterior converges to the ordinary likelihood
. The Bayesian inference provides a robust framework for taking the uncertainties into account when the data is scarce, as it often is in practical modeling tasks. Moreover, the Bayes’ rule provides a formal framework for sequential update of beliefs based on accumulating evidence. The prior predictive densityis a normalizing constant, which is independent of the parameters and can often be ignored during model fitting.
The involved distributions can have complex non-standard forms and limited empirical data can only provide partial evidence regarding the different aspects of the data-generating process. Often only a subset of the parameters and other variables and their interdependencies can be directly observed. The Bayesian approach provides a framework for making inferences on the unobserved quantities through hierarchical models, where the probability distribution of each variable is characterized by higher-level parameters, so-called hyperparameters
. A similar reasoning can be used to model the uncertainty in the hyperparameters, until the uncertainties become modeled at an appropriate detail. Prior information can help to compensate the lack of data on certain aspects of a model, and explicit models for the noise can characterize uncertainty in the empirical observations. Distributions can also share parameters, which provides a basis for pooling evidence from multiple sources, as for instance in Publication4. In many applications only a subset of the parameters in the model are of interest and the modeling process can be considerably simplified by marginalizing over the less interesting nuisance variables to obtain an expectation over their potential values.
The Bayesian paradigm provides a principled framework for modeling the uncertainty at all levels of statistical inference, including the parameters, the observed and latent variables and the model structure; all information of the model is incorporated in the posterior distribution, which summarizes empirical evidence and prior knowledge, and provides a complete description of the expected outcomes of the data-generating process. When the data does not contain sufficient information to decide between the alternative model structures and parameter values, the Bayesian framework provides tools to take expectations over all potential models, weighted by their relative evidence.
A central challenge in the Bayesian analysis is that the models often include analytically intractable posterior distributions, and learning of the models can be computationally demanding. Widely-used approaches for estimating posterior distributions include Markov Chain Monte Carlo (MCMC) methods and variational learning. Stochastic MCMC methods provide a widely-used family of algorithms to estimate intractable distributions by drawing random samples from these distributions (see e.g. Gelman et al., 2003); a sufficiently large pool of random samples will converge to the underlying distribution, and sample statistics can then be used to characterize the distribution. However, sampling-based methods are computationally intensive and slow. In variational learning, considered in Section 3.3.1
, the intractable distributions are approximated by more convenient tractable distributions, which yields faster learning procedure, but potentially less accurate results. While analysis of the full posterior distribution will provide a complete description of the uncertainties regarding the parameters, simplified summary statistics, such as the mean, variance and quantiles of the posterior can provide a sufficient characterization of the posterior in many practical applications. They can be obtained for instance by summarizing the output of sampling-based or variational methods. Moreover, when the uncertainty in the results can be ignored, point estimates can provide simple, interpretable summaries that are often useful in further biomedical analysis, as for instance in Publication2. Point estimates are single optimal values with no distribution. However, point estimates are not necessarily sufficient for instance in biomedical diagnostics and other prediction tasks, where different outcomes are associated with different costs and it may be crucial to assess the probabilities of the alternative outcomes. For further discussion on learning the Bayesian models, see Section 3.3.1.
In this thesis the Bayesian approach provides a formal framework to perform robust inference based on incomplete functional genomics data sets and to incorporate prior information of the models in the analysis. The Bayesian paradigm can alternatively be interpreted as a philosophical position, where probability is viewed as a subjective concept (Cox, 1946), or considered a direct consequence of making rational decisions under uncertainty (Bernardo and Smith, 2000). For further concepts in model selection, comparison and averaging in the Bayesian analysis, see Gelman et al. (2003). For applications in computational biology, see Wilkinson (2007).
The final stage in probabilistic modeling is to learn the optimal statistical presentation for the data, given the model family and the objective function. This section highlights central challenges and methodological issues in statistical learning.
Learning in probabilistic models often focuses on optimizing the model parameters . In addition, posterior distribution of the latent variables, , can be calculated. Estimating the latent variable values is called statistical inference
. In the Bayesian analysis, the model parameters can also be treated as latent variables with a prior probability density, in which case the distinction between model parameters and latent variables will disappear. A comprehensive characterization of the variables and their uncertainty would be achieved by estimating the full posterior distribution. However, this can be computationally very demanding, in particular when the posterior is not analytically tractable. The posterior is often approximated with stochastic or analytical procedures, such as stochastic MCMC sampling methods or variational approximations, and appropriate summary statistics. In many practical settings, it is sufficient to summarize the full posterior distribution with a point estimate. Point estimates do not characterize the uncertainties in the analysis result, but are often more convenient to interpret than full posterior distributions.
Various optimization algorithms are available to learn statistical models, given the learning procedure. The potential challenges in the optimization include computational complexity and the presence of local optima on complex probability density topologies, as well as unidentifiability of the models. Finding a global optimum of a complex model can be computationally exhaustive, and it can become intractable with increasing sample size. In unidentifiable models, the data does not contain sufficient information to choose between alternative models with equal statistical evidence. Ultimately, the uncertainty in inference arises from limited sample size and the lack of computational resources.
In the remainder of this section, let us consider more closely the particular learning procedures central to this thesis: point estimates and variational approximation, and the standard optimization algorithms used to learn such representations.
Assuming independent and identically distributed observations, the likelihood of the data, given model parameters, is . This provides a probabilistic measure of model fit and the objective function to maximize. Maximization of the likelihood with respect to yields a maximum likelihood (ML) estimate of the model parameters, and specifies an optimal model that best describes the data. This is a standard point estimate used in probabilistic modeling. Practical implementations typically operate on log-likelihood, the logarithm of the likelihood function. As a monotone function, this yields the same optima, but has additional desirable properties: it factorizes the product into a sum and is less prone to numerical overflows during optimization.
The maximum a posteriori (MAP) estimate additionally takes prior information of the model parameters into account. While the ML estimate maximizes the likelihood of the observations, the MAP estimate maximizes the posterior of the model parameters. The objective function to maximize is the log-likelihood
The prior is explicit in MAP estimation and the model contains the ML estimate as a special case; assuming large sample size, or non-informative, uniform prior , the likelihood of the data will dominate and the MAP estimation becomes equivalent to optimizing , yielding the traditional ML estimate. The ML and MAP estimates are asymptotically consistent approximations of the posterior distribution, since the posterior will converge a point distribution with a large sample size. The computation and interpretation of point estimates is straightforward compared to the use of posterior distributions in the full Bayesian treatment. The differences between ML and MAP estimates highlight the role of prior information in the modeling when training data is limited.
In certain modeling tasks the uncertainty in the model parameters needs to be taken into account. Then point estimates are not sufficient. The uncertainty is characterized by the posterior distribution . However, the posterior distributions are often intractable and need to be estimated by approximative methods. Variational approximations provide a fast and principled optimization scheme (see e.g. Bishop, 2006) that yields only approximative solutions, but can accelerate posterior inference by orders of magnitude compared to stochastic, sampling-based MCMC methods that can in principle provide exact solutions, assuming that infinite computational resources are available. The potential decrease in accuracy in variational approximations is often acceptable, given the gains in efficiency. Variational approximation characterizes the uncertainty in with a tractable distribution that approximates the full, potentially intractable posterior ,
Variational inference is formulated as an optimization problem where an intractable posterior distribution is approximated by a more easily tract-able distribution by minimizing the KL–divergence between the two distributions. This is also shown to maximize a lower bound of the marginal likelihood , and subsequently the likelihood of the data, yielding an approximation of the overall model. The log-likelihood of the data can be decomposed into a sum of the lower bound of the observed data and the KL–divergence between the approximative and the exact posterior distributions:
The KL-divergence is non-negative, and equals to zero if and only if the approximation and the exact distribution are identical. Therefore gives a lower bound for the log-likelihood in Equation 3.5. Minimization of with respect to will provide an analytically tractable approximation of . Minimization of will also maximize the lower bound since the log-likelihood is independent of . The approximation typically assumes independent parameters and latent variables, yielding a factorized approximation based on tractable standard distributions. It is also possible to factorize and into further components. Variational approximations are used for efficient learning of infinite multivariate Gaussian mixture models in Publication 3.
The EM algorithm is a general procedure for learning probabilistic latent variable models (Dempster et al., 1977), and a special case of variational inference. The algorithm provides an efficient algorithm for finding point estimates for model parameters in latent variable models. The objective of the EM algorithm is to maximize the marginal likelihood
of the observations with respect to the model parameters . Marginalization over the probability density of the latent variables provides an inference procedure that is robust to uncertainty in the latent variable values. The algorithm iterates between estimating the posterior of the latent variables, and optimizing the model parameters (see e.g. Bishop, 2006). Given initial values of the model parameters, the expectation step evaluates the posterior density of the latent variables, , keeping fixed. If the posterior is not analytically tractable, variational approximation can be used to obtain a lower bound for the likelihood in Equation 3.7. The maximization step optimizes the model parameters with respect to the following objective function:
This is the expectation of the complete-data log-likelihood over the latent variable density , obtained from the previous expectation step. The new parameter estimate is then
The expectation and maximization steps determine an iterative learning procedure where the latent variable density and model parameters are iteratively updated until convergence. The maximization step will also increase the target likelihood of Equation 3.7, but potentially with a remarkably smaller computational cost (Dempster et al., 1977). In contrast to the marginal likelihood in Equation 3.7, the complete-data likelihood in Equation 3.8
is logarithmized before integration in the maximization step. When the joint distributionbelongs to the exponential family, the logarithm will cancel the exponential in algebraic manipulations. This can considerably simplify the maximization step. When the likelihoods in the optimization are of suitable form, the iteration steps can be solved analytically, which can considerably reduce required evaluations of the objective function. Convergence is guaranteed, if the optimization can increase the likelihood at each iteration. However, the identification of a global optimum is not guaranteed in the EM algorithm.
Incorporating prior information of the parameter values through Bayesian priors can be used to avoid overfitting and focus the modeling on particular features in the data, as in the regularized dependency modeling framework of Publication 4, where the EM algorithm is used to learn Gaussian latent variable models.
Optimization methods provide standard tools to implement selected learning procedures. Optimization algorithms are used to identify parameter values that minimize or maximize the objective function, either globally, or in local surroundings of the optimized value. Selection of optimization method depends on smoothness and continuity properties of the objective function, required accuracy, and available resources.
Gradient-based approaches optimize the objective function by assuming smooth, continuous topology over the probability density where setting the derivatives to zero will yield local optima. If a closed form solution is not available, it is often possible to estimate gradient directions in a given point. Optimization can then proceed by updating the parameters towards the desired direction along the gradient, gradually improving the objective function value in subsequent gradient ascent steps. So-called quasi-Newton methods use function values and gradients to characterize the optimized manifold, and to optimize the parameters along the approximated gradients. An appropriate step length is identified automatically based on the curvature of the objection function surface. The Broyden-Fletcher-Goldfarb-Shanno (BFGS) (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970) method is a quasi-Newton approach used for standard optimization tasks in this thesis.
Probabilistic models are formulated in terms of probability distributions over the sample space and parameter values. This forms the basis for generalization to new, unobserved events. A generalizable model can describe essential characteristics of the underlying process that generated the observations; a generalizable model is also able to characterize future observations. Overlearning, or overfitting refers to models that describe the training data well, but do not generalize to new observations. Such models describe not only the general processes underlying the observations, but also noise in the particular observations. Avoiding overfitting is a central aspect in modeling. Overlearning is particularly likely when training data is scarce. While overfitting could in principle be avoided by collecting more data, this is often not feasible since the cost of data collection can be prohibitively large.
Generalizability can be measured by investigating how accurately the model describes new observations. A standard approach is to split the data into a training set, used to learn the model, and a test set, used to measure model performance on unseen observations that were not used for training. In cross-validation the test is repeated with several different learning and test sets to assess the variability in the testing procedure. Cross-validation is used for instance in Publication 5 of this thesis. Bootstrap analysis (see, for instance, Efron and Tibshirani, 1994) is another widely used approach to measure model performance. The observed data is viewed as a finite realization of an underlying probability density. New samples from the underlying density are obtained by re-sampling the observed data points with replacement to simulate variability in the original data; observations from the more dense regions of the probability space become re-sampled more often than rare events. Each bootstrap sample resembles the probability density of the original data. Modeling multiple data sets obtained with the bootstrap helps to estimate the sensitivity of the model to variations in the data. Bootstrap is used to assess model performance in Publication 6.
In general, increasing model complexity will yield more flexible models, which have higher descriptive power but are, on the other hand, more likely to overfit. Therefore relatively simple models can often outperform more complex models in terms of generalizability. A compromise between simplicity and descriptive power can be obtained by imposing additional constraints or soft penalties in the modeling to prefer compact solutions, but at the same time retain the descriptive power of the original, flexible model family. This is called regularization. Regularization is particularly important when the sample size is small, as demonstrated for instance in Publication 4, where explicit and theoretically principled regularization is achieved by setting appropriate priors on the model structure and parameter values. The priors will then affect the MAP estimate of the model parameters. One commonly used approach is to prefer sparse solutions that allow only a small number of the potential parameters to be employed at the same time to model the data (see e.g. Archambeau and Bach, 2008). A family of probabilistic approaches to balance between model fit and model complexity is provided by information-theoretic criteria (see e.g. Gelman et al., 2003). The Bayesian Information Criterion (BIC) is a widely used information criterion that introduces a penalty term on the number of model parameters to prefer simpler models. The log-likelihood of the data, given the model, is balanced by a measure of model complexity, , in the final objective function . Here denotes the number of model parameters and is the constant sample size of the investigated data set. The BIC has been criticized since it does not address changes in prior distributions, and its derivation is based on asymptotic considerations that hold only approximately with finite sample size (see e.g. Bishop, 2006). On the other hand, BIC provides a principled regularization procedure that is easy to implement. In this thesis, the BIC has been used to regularize the algorithms in Publication 3.
After learning a probabilistic model, it is necessary to confirm the quality of the model and verify potential findings in further, independent experiments. Validation refers to a versatile set of approaches used to investigate model performance, as well as in model criticism, comparison and selection. Internal and external approaches provide two complementary categories for model validation. Internal validation refers to procedures to assess model performance based on training data alone. For instance, it is possible to estimate the sensitivity of the model to initialization, parameterization, and variations in the data, or convergence of the learning process. Internal analysis can help to estimate the weaknesses and generalizability of the model, and to compare alternative models. Bootstrap and cross-validation are widely used approaches for internal validation and the analysis of model performance (see e.g. Bishop, 2006). Bootstrap can provide information about the sensitivity of the results to sampling effects in the data. Cross-validation provides information about the model generalization performance and robustness by comparing predictions of the model to real outcomes. External validation approaches investigate model predictions and fit on new, independent data sets and experiments. Exploratory analysis of high-throughput data sets often includes massive multiple testing, and provides potentially thousands of automatically generated hypotheses. Only a small set of the initial findings can be investigated more closely by human intervention and costly laboratory experiments. This highlights the need to prioritize the results and assess the uncertainty in the models.
Measurement data obtained with novel high-throughput technologies comes with high levels of uncontrolled biological and technical variation. This is often called noise as it obscures the measurements, and adds potential bias and variance on the signal of interest. Biological noise is associated with natural biological variation between cell populations, cellular processes and individuals. Single-nucleotide polymorphisms, alternative splicing and non-specific hybridization add biological variation in the data (Dai et al., 2005; Zhang et al., 2005). More technical sources of noise in the measurement process include RNA extraction and amplification, experiment-specific variation, as well as platform- and laboratory-specific effects (Choi et al., 2003; MAQC Consortium, 2006; Tu et al., 2002).
A significant source of noise on gene expression arrays comes from individual probes that are designed to measure the activity of a given transcript in a biological sample. Figure 4.1A shows probe-level observations of differential gene expression for a collection of probes designed to target the same mRNA transcript. One of the probes is highly contaminated and likely to add unrelated variation to the analysis. A number of factors affect probe performance. For instance, it has been reported in Publication 1 and elsewhere (Hwang et al., 2004; Mecham et al., 2004b) that a large portion of microarray probes may target unintended mRNA sequences. Moreover, although the probes have been designed to uniquely hybridize with their intended mRNA target, remarkable cross-hybridization with the probes by single-nucleotide polymorphisms (Dai et al., 2005; Sliwerska et al., 2007) and other mRNAs with closely similar sequences (Zhang et al., 2005) have been reported; high-affinity probes with high GC-content may have higher likelihood of cross-hybridization with nonspecific targets (Mei et al., 2003). Alternative splicing (MAQC Consortium, 2006) and mRNA degradation (Auer et al., 2003) may cause differences between probes targeting different positions of the gene sequence. Such effects will contribute to probe-level contamination in a probe- and condition-specific manner. However, sources of probe-level noise are still poorly understood (Irizarry et al., 2005; Li et al., 2005) despite their importance for expression analysis and probe design.
High levels of noise set specific challenges for analysis. Better understanding of the technical aspects of the measurement process will lead to improved analytical procedures and ultimately to more accurate biological results (Reimers, 2010). Publication 2 provides computational tools to investigate probe performance and the relative contributions of the various sources of probe-level contamination on short oligonucleotide arrays.
Preprocessing of the raw data obtained from the original measurements can help to reduce noise and improve comparability between microarray experiments. Preprocessing can be defined in terms of statistical transformations on the raw data, and this is a central part of data analysis in high-throughput studies. This section outlines the standard preprocessing steps for short oligonucleotide arrays, the main source of transcriptional profiling data in this thesis. However, the general concepts also apply to other microarray platforms (Reimers, 2010).
A number of preprocessing techniques for short oligonucleotide arrays have been introduced (Irizarry et al., 2006; Reimers, 2010). The standard preprocessing steps in microarray analysis include quality control, background correction, normalization and summarization.
Microarray quality control is used to identify arrays with remarkable experimental defects, and to remove them from subsequent analysis. The typical tests consider RNA degradation levels and a number of other summary statistics to guarantee that the array data is of reasonable quality. The arrays that pass the microarray quality control are preprocessed further. Each array typically has spatial biases that vary smoothly across the array, arising from technical factors in the experiment. Background correction is used to detect and remove such spatial effects from the array data, and to provide a uniform background signal, enhancing the comparability of the probe-level observations between different parts of the array. Moreover, background correction can estimate the general noise level on the array; this helps to detect probes whose signal differs significantly from the background noise. Robust multi-array averaging (RMA) is one of the most widely used approaches for preprocessing short oligonucleotide array data (Irizarry et al., 2003a). The background correction in RMA is based on a global model for probe intensities. The observed intensity, , is modeled as a sum of an exponential signal component, and Gaussian noise . Background corrected data is then obtained as the expectation . While background correction makes the observations comparable within array, normalization is used to improve the comparability between arrays. Quantile normalization is a widely used method that forces all arrays to follow the same empirical intensity distribution (see e.g. Bolstad et al., 2003). Quantile normalization makes the measurements across different arrays comparable, assuming that the overall distribution of mRNA concentration is approximately the same in all cell populations. This has proven to be a feasible assumption in transcriptional profiling studies. As always, there are exceptions. For instance, human brain tissues have systematic differences in gene expression compared to other organs. On short oligonucleotide arrays, a number of probes target the same transcript. In the final summarization step, the individual probe-level observations of each target transcript are summarized into a single summary estimate of transcript activity. Standard algorithmic implementations are available for each preprocessing step.
Differences in probe characteristics cause systematic differences in probe performance. The use of several probes for each target leads to more robust estimates on transcript activity but it is clear that probe quality may significantly affect the results of a microarray study (Irizarry et al., 2003b). Widely used preprocessing algorithms utilize probe-specific parameters to model probe-specific effects in the probe summarization step. Some of the first and most well-known probe-level preprocessing algorithms include dChip/MBEI (Li and Wong, 2001), RMA (Irizarry et al., 2003a), and gMOS (Milo et al., 2003). Taking probe-level effects into account can considerably improve the quality of a microarray study (Reimers, 2010). Publications 1 and 2 incorporate side information of the probes to preprocessing, and introduce improved probe-level analysis methods for differential gene expression studies.
In order to introduce probe-level preprocessing methods in more detail, let us consider the probe summarization step of the RMA algorithm (Irizarry et al., 2003a). RMA has a Gaussian model for probe effects with probe-specific mean parameters and a shared variance parameter for the probes. The mean parameters characterize probe-specific binding affinities that cause systematic differences in the signal levels captured by each probe. Estimating the probe-specific effects helps to remove this effect in the final probeset-level summary of the probe-level observations. To briefly outline the algorithm, let us consider a collection of probes (a probeset) that measure the expression level of the same target transcript in condition . The probe-level observations are modeled as a sum of the true, underlying expression signal , which is common to all probes, probe-specific binding affinity , and Gaussian noise . A probe-level observation for probe in condition is then modeled in RMA as
Measurements from multiple conditions are needed to estimate the probe-specific effects . RMA and other models that measure absolute gene expression have an important drawback: the probe affinity effects are unidentifiable. In order to obtain an identifiable model, the RMA algorithm includes an additional constraint that the probe affinity effects are zero on average: . This yields a well-defined algorithm that has been shown to produce accurate measurements of gene expression in practical settings. Further extensions of the RMA algorithm include gcRMA, which has a more detailed chemical model for the probe effects (Wu and Irizarry, 2004), refRMA (Katz et al., 2006), which utilizes probe-specific effects derived from background data collections, and fRMA (McCall et al., 2010), which also models batch-specific effects in microarray studies. The estimation of unidentifiable probe affinities is a main challenge for most probe-level preprocessing models.
RMA and other probe-level models for short oligonucleotide arrays have been designed to estimate absolute expression levels of the genes. However, gene expression studies are often ultimately targeted at investigating differential expression levels, that is, differences in gene expression between experimental conditions. Measurements of differential expression is obtained for instance by comparing the expression levels, obtained through the RMA algorithm or other methods, between different conditions. However, the summarization of the probe-level values is then performed prior to the actual comparison. Due to the unidentifiability of the probe affinity parameters in the RMA and other probe-level models, this is potentially suboptimal. Publication 1 demonstrates that reversing the order, i.e., calculating differential gene expression already at the probe level before probeset-level summarization, leads to improved estimates of differential gene expression. The explanation is that the procedure circumvents the need to estimate the unidentifiable probe affinity parameters. This is formally described in Publication 2, which provides a probabilistic extension of the Probe-level Expression Change Averaging (PECA) procedure of Publication 1. In PECA, a standard weighted average statistics summarizes the probe level observations of differential gene expression. PECA does not model probe-specific effects, but it is shown to outperform widely used probe-level preprocessing methods, such as the RMA, in estimating differential expression. Publication 2, considered in more detail in Section 4.3, provides an extended probabilistic framework that also models probe-specific effects.
Probe-level preprocessing models and microarray analysis can be further improved by utilizing external information of the probes (Eisenstein, 2006; Hwang et al., 2004; Katz et al., 2006). Although any given microarray is designed on most up-to-date sequence information available, rapidly evolving genomic sequence data can reveal inaccuracies in probe annotations when the body of knowledge grows. In recent studies, including Publication 1, a remarkable number of probes on various oligonucleotide arrays have been detected not to uniquely match their intended target (Hwang et al., 2004; Mecham et al., 2004a). A remarkable portion of probes on several popular microarray platforms in human and mouse did not match with their intended mRNA target, or were found to target unintended mRNA transcripts in the Entrez Nucleotide (Wheeler et al., 2005) sequence database in Publication 1 (Table 4.1). The observations are in general concordant with other studies, although the exact figures vary according to the utilized database and comparison details (Gautier et al., 2004; Mecham et al., 2004b). In this thesis, strategies are developed to improve microarray analysis with background information from genomic sequence databases, and with model-based analysis of microarray collections.
Probe verification is increasingly used in standard preprocessing, and to confirm the results of a microarray study. Matching the probe sequences of a given array to updated genomic sequence databases and constructing an alternative interpretation of the array data based on the most up-to-date genomic annotations has been shown to increase the accuracy and cross-platform consistency of microarray analyses in Publication 1 and elsewhere (Dai et al., 2005; Gautier et al., 2004).
Publication 1 combines probe verification with a novel probe-level preprocessing method, PECA, to suggest a novel framework for comparing and combining results across different microarray platforms. While huge repositories of microarray data are available, the data for any particular experimental condition is typically scarce, and coming from a number of different microarray platforms. Therefore reliable approaches for integrating microarray data are valuable. Integration of results across platforms has proven problematic due to various sources of technical variation between array technologies. Matching of probe sequences between microarray platforms has been shown to increase the consistency of microarray measurements (Hwang et al., 2004; Mecham et al., 2004b). However, probe matching between array platforms guarantees only technical comparability (Irizarry et al., 2005). Probe verification against external sequence databases is needed to confirm that the probes are also biologically accurate. This can also improve the comparability across array platforms, as confirmed by the validation studies in Publication 1 (Figure 4.2A).
The PECA method of Publication 1 utilizes genomic sequence databases to reduce probe-level noise by removing erroneous probes based on updated genomic knowledge. The strategy relies on external information in the databases and can therefore only remove known sources of probe-level contamination. Publication 2 introduces a probabilistic framework to measure probe reliability directly based on microarray data collections. The analysis can reveal both well-characterized and unknown sources of probe-level contamination, and leads to improved estimates of gene expression. This model, coined Robust Probabilistic Averaging (RPA), also provides a theoretically justified framework for incorporating prior knowledge of the probes into the analysis.
|Array type||Number of probes||Verified probes (%)|
Standard approaches for investigating probe performance typically rely on external information, such as genomic sequence data (see Mecham et al. 2004b; Zhang et al. 2005 and Publication 1) or physical models (Naef and Magnasco, 2003; Wu et al., 2005). However, such models cannot reveal probes with uncharacterized sources of contamination, such as cross-hybridization with alternatively spliced transcripts or closely related mRNA sequences. Vast collections of microarray data are available in public repositories. These large-scale data sets contain valuable information of both biological and technical aspects of gene expression studies. Publication 2 introduces a data-driven strategy to extract and utilize probe-level information in microarray data collections.
The model, Robust Probabilistic Averaging (RPA), is a probabilistic preprocessing procedure that is based on explicit modeling assumptions to analyze probe reliability and quantify the uncertainty in measurement data based on gene expression data collections, independently of external information of the probes. The model can be viewed as a probabilistic extension of the probe-level preprocessing approach for differential gene expression studies presented in Publication 1. The explicit Bayesian formulation quantifies the uncertainty in the model parameters, and allows the incorporation of prior information concerning probe reliability into the analysis. RPA provides estimates of probe reliability, and a probeset-level estimate of differential gene expression directly from expression data and independently of the noise source. The RPA model is independent of physical models or external and constantly updated information such as genomic sequence data, but provides a framework for incorporating such prior information of the probes in gene expression analysis.
Other probabilistic methods for microarray preprocessing include BGX (Hein et al., 2005), gMOS (Milo et al., 2003) and its extensions (Liu et al., 2005). The key difference to the RPA procedure of Publication 2 is that these methods are designed to provide probeset-level summaries of absolute gene expression levels, and suffer from the same unidentifiability problem of probe affinity parameters as the RMA algorithm (Irizarry et al., 2003a). In contrast, RPA models probe-level estimates of differential gene expression. This removes the unidentifiability issue, which is advantageous when the objective is to compare gene expression levels between experimental conditions. Another important difference is that the other preprocessing methods do not provide explicit estimates of probe-specific parameters, or tools to investigate probe performance. Publication 2 assigns an explicit probabilistic measure of reliability to each probe. This gives tools to analyze probe performance and to guide probe design.
Let us now consider in more detail the probabilistic preprocessing framework, RPA, introduced in Publication 2. Probe performance is ultimately determined by its ability to accurately measure the expression level of the target transcript, which is unknown in practical situations. Although the performance of individual probes varies, the collection of probes designed to measure the same transcript will provide ground truth for assessing probe performance (Figure 4.1A). RPA captures the shared signal of the probes within a probeset, and assumes that the shared signal characterizes the expression of the common target transcript of the probes. The reliability of individual probes is estimated with respect to the strongest shared signal of the probes. RPA assumes normally distributed probe effects, and quantifies probe reliability based on probe variance around the probeset-level signal across a large number of arrays. This extends the formulation of the RMA model in Equation 4.1 by introducing an additional probe-specific Gaussian noise component:
In contrast to RMA, the variance is probe-specific in this model, and distributed as . The variance parameters are of interest in probe reliability analysis; they reflect the noise level of the probe, in contrast to probe-level preprocessing methods that focus on estimating the unidentifiable mean parameter of the Gaussian noise model, corresponding to probe affinity (see e.g. Irizarry et al., 2003a; Li and Wong, 2001). In Publication 2, probe-level calculation of differential expression avoids the need to model unidentifiable probe affinities, the key probe-specific parameter in other probe-level preprocessing methods. More formally, the unidentifiable probe affinity parameters cancel out in RPA when the signal log-ratio between a user-specified ’reference’ array and the remaining arrays is computed for each probe: the differential expression signal between arrays and the reference array for probe is obtained by . In vector notation, the differential expression profile of probe across the arrays is then written as , i.e., a noisy observation of the true underlying differential expression signal and probe-specific noise .
The unidentifiable probe affinity parameters cancel out in the RPA model of Publication 2. This can partly explain the previous empirical observations that calculating differential expression already at probe-level improves the analysis of differential gene expression (Zhang et al., 2002; Elo et al., 2005). However, the previous models are non-probabilistic preprocessing methods that do not aim at quantifying the uncertainty in the probes. Use of a single parameter for probe effects in RPA also gives more straightforward interpretations of probe reliability.
Posterior estimates of the model parameters are derived to estimate probe reliability and differential gene expression. The differential expression vector and the probe-specific variances are estimated simultaneously. The posterior density of the model parameters is obtained from the likelihood of the data and the prior according to Bayes’ rule (Equation 3.3) as
To obtain this posterior, let us consider the likelihood of the data and the prior of the model parameters. The noise on the selected control array is a latent variable, and marginalized out in the model to obtain the likelihood:
Let us assume independent priors, , flat non-informative prior
and conjugate priors for the variance parameters in(inverse Gamma function, see Gelman et al. 2003). With these standard assumptions, the prior takes the form
where and are the shape and scale parameters of the inverse Gamma distribution. Prior information of the probes can be incorporated in the analysis through these parameters. Probe-level differential expression is then described by two sets of parameters; the differential gene expression vector , and the probe-specific variances . High variance indicates that the probe-level observation is strongly deviated from the estimated true signal . Denoting and , the posterior of the model parameters in Equation 4.3 takes the form
The formulation allows estimating the uncertainty in the expression estimates and probe-level parameters. In practice, a MAP point estimate of the parameters, obtained by maximizing the posterior, is often sufficient. In the limit of a large sample size (), the model will converge to estimating ordinary mean and variance parameters. With limited sample sizes that are typical in microarray studies the prior parameters provide regularization that makes the probabilistic formulation more robust to overfitting and local optima, compared to direct estimation of the mean and variance parameters. Moreover, the probabilistic analysis takes the uncertainty in the data and model parameters into account in an explicit manner.
The model also provides a principled framework for incorporating prior knowledge probe reliability in microarray preprocessing through the probe-specific hyperparameters . Estimation and use of probe-specific effects from external microarray data collections has been previously suggested in the context of the refRMA method by Katz et al. (2006), where such side information was shown to improve gene expression estimates. The RPA method of Publication 2 provides an alternative probabilistic treatment.
The probabilistic RPA model introduced in Publication 2 was validated by comparing the preprocessing performance to other preprocessing methods, and additionally by comparing the estimates of probe-level noise to known sources of probe-level contamination. The comparison methods include the FARMS (Hochreiter, 2006), MAS5 (Hubbell et al., 2002), PECA (Publication 1), and RMA (Irizarry et al., 2003a) preprocessing algorithms. FARMS has a more detailed model for probe effects than the other methods, and it contains implicitly a similar probe-specific variance parameter than our RPA model. FARMS is based on a factor analysis model, and is defined as , where captures the underlying gene expression. In contrast to RMA and RPA that have a single probe-specific parameter, FARMS has three probe-specific parameters . MAS5 is a standard preprocessing algorithm provided by the array manufacturer. The algorithm performs local background correction, utilizes so-called mismatch probes to control for non-specific hybridization, and scales the data from each array to the same average intensity level to improve comparability across arrays. MAS5 summarizes probe-level observations of absolute gene expression levels using robust summary statistics, Tukey biweight estimate, but unlike FARMS, RMA and RPA, MAS5 does not model probe-specific effects.
The preprocessing performance of these methods was investigated in spike-in experiments where certain target transcripts measured by the array have been spiked in at known concentrations, as well as on real data sets. The results from the spike-in experiments were compared in terms of receiver operating characteristics (ROC). The standard RMA, PECA (Publication 1) and RPA (Publication 2) had comparable performance in spike-in data, and they outperformed the MAS5 (Hubbell et al., 2002) and FARMS (Hochreiter, 2006) preprocessing algorithms in estimating differential gene expression. On real data sets, PECA and RPA outperformed the other methods, providing higher reproducibility between technical replicates measured on different microarray platforms (Figure 4.2B).
In contrast to standard preprocessing algorithms, RPA provides explicit quantitative estimates of probe performance. The model has been validated on widely used human whole-genome arrays by comparing the estimates of probe reliability with known probe-level error sources: errors in probe-genome alignment, interrogation position of a probe on the target sequence, GC-content, and the presence of SNPs in the probe target sequences; a good model for assessing probe reliability should detect probes contaminated by the known error sources. The results from our analysis can be used to characterize the relative contribution of different sources of probe-level noise (Figure 4.1B). In general, the probes with known sources of contamination were more noisy than the other probes, with 7-39% increase in the average variance, as detected by RPA. Any single source of error seems to explain only a fraction of the most highly contaminated probes. A large portion (35-60%) of the detected least reliable probes were not associated with the investigated known noise sources. This suggests that previous methods that remove probe-level noise based on external information, such as genomic alignments will fail to detect a significant portion of poorly performing probes. The RPA model of Publication 2 provides rigorous algorithmic tools to investigate the various probe-level error sources. Better understanding of the factors affecting probe performance can advance probe design and contribute to reducing probe-related noise in future generations of gene expression arrays.
The contributions presented in this Chapter provide improved preprocessing strategies for differential gene expression studies. The introduced techniques utilize probe-level analysis, as well as side information in sequence and microarray databases. Probe-level studies have led to the establishment of probe verification and alternative microarray interpretations as a standard step in microarray preprocessing and analysis. The alternative interpretations for microarray data based on updated genomic sequence data (Gautier et al., 2004; Dai et al., 2005) are now implemented as routine tools in popular preprocessing algorithms such as the RMA, or the RPA method of Publication 2. The probe-level analysis strategy has been recently extended to exon array context, where expression levels of alternative splice variants of the same genes are compared under particular experimental conditions. The probe-level approach has shown superior preprocessing performance also with exon arrays (Laajala et al., 2009). A convenient access to the algorithmic tools developed in Publications 1 and 2 for microarray preprocessing and probe-level analysis is provided by the accompanied open source implementation in BioConductor.111http://www.bioconductor.org/packages/release/bioc/html/RPA.html
Global observations of transcriptional activity reflect known and previously uncharacterized cell-biological processes. Exploratory analysis of the transcriptome can provide research hypotheses and material for more detailed investigations. Widely-used standard approaches for global transcriptome analysis include various clustering, dimensionality reduction and visualization techniques (see e.g. Huttenhower and Hofmann, 2010; Polanski and Kimmel, 2007; Quackenbush, 2001). The large data collections open up new possibilities to investigate functional relatedness between physiological conditions, disease states, as well as cellular processes, and to discover previously uncharacterized connections and functional mechanisms (Bergmann et al., 2004; Kilpinen et al., 2008; Lukk et al., 2010).
Gene expression studies have traditionally focused on the analysis of relatively small and targeted data sets, such as particular diseases or cell types. A typical objective is to detect genes, or gene groups, that are differentially expressed between particular conditions, for instance to predict disease outcomes, or to identify potentially unknown disease subtypes. The increasing availability of large and versatile transcriptome collections that may cover thousands of experimental conditions allows global, data-driven analysis, and the formulation of novel research questions where the traditional analysis methods are often insufficient (Huttenhower and Hofmann, 2010).
A variety of approaches have been proposed and investigated in the recent years in the global transcriptome analysis context. An actively studied modeling problem in transcriptome analysis is the discovery of transcriptional modules, i.e., identification of coherent gene groups that show coordinated transcriptional responses under particular conditions (Segal et al., 2003a, 2004; Stuart et al., 2003). Models have also been proposed to predict gene regulators (Segal et al., 2003b), and to infer cellular processes and networks based on transcriptional activation patterns (Friedman, 2004; Segal et al., 2003c). An increasing number of models are being developed to integrate transcriptome measurements to other sources of genomic information, such as regulation and interactions between the genes to detect and characterize cellular processes and disease mechanisms (Barash and Friedman, 2002; Chari et al., 2010; Vaske et al., 2010). Findings from transcriptome analysis have potential biomedical implications, as in Lamb et al. (2006), where chemically perturbed cancer cell lines were screened to enhance the detection of drug targets based on shared functional mechanisms between disparate conditions, or in Sørlie et al. (2001), where cluster analysis of cancer patients based on genome-wide transcriptional profiling experiments led to the discovery of a novel breast cancer subtype. In the remainder of this section, the modeling approaches that are particularly closely related to the contributions of this thesis are considered in more detail.
A popular strategy for genome-wide gene expression analysis is to consider known biological processes and their activation patterns across diverse collections measurement data from various experimental conditions. Biomedical databases contain a variety of information concerning genes and their interactions. For instance, the Gene Ontology database (Ashburner et al., 2000) provides functional and molecular classifications for the genes in human and a number of other organisms. Other categories are based on micro-RNA regulation, chromosomal locations, chemical perturbations and other features (Subramanian et al., 2005). Joint analysis of functionally related genes can increase the statistical power of the analysis. So-called gene set-based approaches are typically designed to test differential expression between two particular conditions (Goeman and Buhlmann, 2007; Nam and Kim, 2008), but they can also be used to build global maps of transcriptional activity of the known processes (Levine et al., 2006). However, gene set-based approaches typically ignore more detailed information of the interactions between individual genes. Pathway and interaction databases contain more detailed information concerning molecular interactions and cell-biological processes (Kanehisa et al., 2008; Vastrik et al., 2007). Network-based methods utilize relational information of the genes to guide expression analysis. For instance, Draghici et al. (2007) demonstrated that taking into account aspects of pathway topology, such as gene and interaction types, can improve the estimation of pathway activity between two predefined conditions. Another recent approach which utilizes pathway topology in inferring pathway activity is PARADIGM (Vaske et al., 2010), which also integrates other sources of genomic information in pathway analysis. However, these methods have been designed for the analysis of particular experimental conditions, rather than comprehensive expression atlases. MATISSE (Ulitsky and Shamir, 2007) is a network-based approach that searches for functionally related genes that are connected in the network, and have correlated expression profiles across many conditions. The potential shortcoming of this approach is that it assumes global correlation across all conditions between the interacting genes, while many genes can have multiple, context-sensitive functional roles. Different conditions induce different responses in the same genes, and the definition of ’gene set’ is vague (Montaner et al., 2009; Nacu et al., 2007). Therefore methods have been suggested to identify ’key condition-responsive genes’ of predefined gene sets (Lee et al., 2008), or to decompose predefined pathways into smaller and more specific functional modules (Chang et al., 2009). These approaches rely on predefined functional classifications for the genes. The data-driven analysis in Publication 3 provides a complementary approach where the gene sets are learned directly from the data, guided by prior knowledge of genetic interactions. This avoids the need to refine suboptimal annotations, and enables the discovery of new processes. The findings demonstrate that simply measuring whether a gene set, or a network, is differentially expressed between particular conditions is often not sufficient for measuring the activity of cell-biological processes. Since gene function and interactions are regulated in a context-specific manner, it is important to additionally characterize how, and in which conditions the expression changes. Global analysis of transcriptional activation patterns interaction networks, introduced in Publication 3, can address such questions.
Approaches that are based on previously characterized genes and processes are biased towards well-characterized phenomena. This limits their value in de novo discovery of functional patterns. Unsupervised methods provide tools for such analysis, but often with an increased computational cost and a higher proportion of false positive findings.
Cluster analysis is widely used for unsupervised analysis of gene expression data, providing tools for class discovery, gene function prediction and for visualization purposes. Examples of widely used clustering approaches include hierarchical clustering and K-means (see e.g. Polanski and Kimmel, 2007). Clustering of patient samples with similar expression profiles has led to the discovery of novel cancer subtypes with biomedical implications (Sørlie et al., 2001); clustering of genes with coordinated activation patterns can be used, for instance, to predict novel functional associations for poorly characterized genes (Allocco et al., 2004). The self-organizing map (Kohonen, 1982, 2001) is a related approach that provides efficient tools to visualize high-dimensional data on lower-dimensional displays, with particular applications in transcriptional profiling studies (Tamayo et al., 1999; Törönen et al., 1999). The standard clustering methods are based on comparison of global expression patterns, and therefore are relatively coarse tools for analyzing large transcriptome collections. Different genes respond in different ways, as well as in different conditions. Therefore it is problematic to find clusters in high-dimensional data spaces, such as in whole-genome expression profiling studies; different gene groups can reveal different relationships between the samples. Detection of smaller, coherent subspaces with a particular structure can be useful in biomedical applications, where the objective is to identify sets of interesting genes for further analysis. Both genes and the associated conditions may be unknown, and the learning task is to detect them from the data. This can help, for instance, in identifying responses to drug treatments in particular genes (Ihmels et al., 2002; Tanay et al., 2002), or in identifying functionally coherent transcriptional modules in gene expression databases (Segal et al., 2004; Tanay et al., 2005).
Subspace clustering methods (Parsons et al., 2004) provide a family of algorithms that can be used to identify subsets of dependent features revealing coherent clustering for the samples; this defines a subspace in the original feature space. Subspace clustering models are a special case of a more general family of biclustering algorithms (Madeira and Oliveira, 2004). Closely related models are also called co-clustering (Cho et al., 2004), two-way clustering Gad et al. (2000), and plaid models (Lazzeroni and Owen, 2002). Biclustering methods provide general tools to detect co-regulated gene groups and associated conditions from the data, to provide compact summaries and to aid interpretation of transcriptome data collections. Biclustering models enable the discovery of gene expression signatures (Hu et al., 2006) that have emerged as a central concept in global expression analysis context. A signature describes a co-expression state of the genes, associated with particular conditions. Established signatures have been found to be reliable indicators of the physiological state of a cell, and commercial signatures have become available for routine clinical practice (Nuyten and van de Vijver, 2008). However, the established signatures are typically designed to provide optimal classification performance between two particular conditions. The problem with the classification-based signatures is that their associations to the underlying physiological processes are not well understood (Lucas et al., 2009). In Publication 3 the understanding is enhanced by deriving transcriptional signatures that are explicitly connected to well-characterized processes through the network.
Standard clustering models ignore prior information of the data, which could be used to supervise the analysis, to connect the findings to known processes, as well as to improve scalability. For instance, standard model-based feature selection, or subspace clustering techniques would consider all potential connections between the genes or features (Law et al., 2004; Roth and Lange, 2004). Without additional constraints on the solution space they can typically handle at most tens or hundreds of features, which is often insufficient in high-throughput genomics applications. Use of side information in clustering can help to guide unsupervised analysis, for instance based on known or potential interactions between the genes. This has been shown to improve the detection of functionally coherent gene groups (Hanisch et al., 2002; Shiga et al., 2007; Ulitsky and Shamir, 2007; Zhu et al., 2005). However, while these methods provide tools to cluster the genes, they do not model differences between conditions. Extensions of biclustering models that can utilize relational information of the genes include cMonkey (Reiss et al., 2006) and a modified version of SAMBA biclustering (Tanay et al., 2004). However, cMonkey and SAMBA are application-oriented tools that rely on additional, organism-specific information, and their implementation is currently not available for most organisms, including that of the human. Further application-oriented models for utilizing side information in the discovery of transcriptional modules have recently been proposed for instance by Savage et al. (2010) and Suthram et al. (2010). Publication 3 introduces a complementary method where the exhaustively large search space is limited with side information concerning known relations between the genes, derived from genomic interaction databases. This is a general algorithmic approach whose applicability is not limited to particular organisms.
Prior information on the cellular networks, regulatory mechanisms, and gene function is often available, and can help to construct more detailed models of gene function and network analysis, as well as to summarize functional aspects of genomic data collections (Huttenhower et al., 2009; Segal et al., 2003b; Troyanskaya, 2005). Versatile transcriptome collections also enable network reconstruction, i.e., de novo discovery (Lezon et al., 2006; Myers et al., 2005) and augmentation (Novak and Jain, 2006) of genetic interaction networks. Other methodological approaches for global transcriptome analysis are provided by probabilistic latent variable models (Rogers et al., 2005; Segal et al., 2003a), hierarchical Dirichlet process algorithms (Gerber et al., 2007)
, as well as matrix and tensor computations(Alter and Golub, 2005). These methods provide further model-based tools to identify and characterize transcriptional programs by decomposing gene expression data sets into smaller, functionally coherent components.
Molecular interaction networks cover thousands of genes, proteins and small molecules. Coordinated regulation of gene function through molecular interactions determines cell function, and is reflected in transcriptional activity of the genes. Since individual processes and their transcriptional responses are in general unknown (Lee et al., 2008; Montaner et al., 2009), data-driven detection of condition-specific responses can provide an efficient proxy for identifying distinct transcriptional states of the network with potentially distinct functional roles. While a number of methods have been proposed to compare network activation patterns between particular conditions (Draghici et al., 2007; Ideker et al., 2002; Cabusora et al., 2005; Noirel et al., 2008), or to use network information to detect functionally related gene groups (Segal et al., 2003d; Shiga et al., 2007; Ulitsky and Shamir, 2007), general-purpose algorithms for a global analysis of context-specific network activation patterns in a genome- and organism-wide scale have been missing.
Publication 3 introduces and validates two general-purpose algorithms that provide tools for global modeling of transcriptional responses in interaction networks. The motivation is similar to biclustering approaches that detect functionally coherent gene groups that show coordinated response in a subset of conditions (Madeira and Oliveira, 2004). The network ties the findings more tightly to cell-biological processes, focusing the analysis and improving interpretability. In contrast to previous network-based biclustering models for global transcriptome analysis, such as cMonkey (Reiss et al., 2006) or SAMBA (Tanay et al., 2004), the algorithms introduced in Publication 3 are general-purpose tools, and do not depend on organism-specific annotations.
The first approach in Publication 3 is a straightforward extension of network-based gene clustering methods. In this two-step approach, the functionally coherent subnetworks, and their condition-specific responses are detected in separate steps. In the first step, a network-based clustering method is used to detect functionally coherent subnetworks. In Publication 3, MATISSE, a state-of-the-art algorithm described in Ulitsky and Shamir (2007), is used to detect the subnetworks. MATISSE finds connected subgraphs in the network that have high internal correlations between the genes. In the second step, condition-specific responses of each identified subnetwork are searched for by a nonparametric Gaussian mixture model, which allows a data-driven detection of the responses. However, the two-step approach, coined MATISSE+, can be suboptimal for detecting subnetworks with particular condition-specific responses. The main contribution of Publication 3 is to introduce a second general-purpose algorithm, coined NetResponse, where the detection of condition-specific responses is used as the explicit key criterion for subnetwork search.
The network-based search procedure introduced in Publication 3 searches for local subnetworks, i.e., functionally coherent network modules where the interacting genes show coordinated responses in a subset of conditions (Figure 5.1). Side information of the gene interactions is used to guide modeling, but the algorithm is independent of predefined classifications for genes or measurement conditions. Transcriptional responses of the network are described in terms of subnetwork activation. Regulation of the subnetwork genes can involve simultaneous activation and repression of the genes: sufficient amounts of mRNA for key proteins has to be available while interfering genes may need to be silenced. The model assumes that a given subnetwork can have multiple transcriptional states, associated with different physiological contexts. A transcriptional state is reflected in a unique expression signature , a vector that describes the expression levels of the subnetwork genes, associated with the particular transcriptional state. Expression of some genes is regulated at precise levels, whereas other genes fluctuate more freely. Given the state, expression of the subnetwork genes is modeled as a noisy observation of the transcriptional state. With a Gaussian noise model with covariance , the observation is described by . A given subnetwork can have latent transcriptional states indexed by . In practice, the states, including their number , are unknown, and they have to be estimated from the data. In a specific measurement condition, the subnetwork can be in any one of the latent physiological states indexed by . Associations between the observations and the underlying transcriptional states are unknown and they are treated as latent variables. Gene expression in subnetwork is then modeled with a Gaussian mixture model:
where each component distribution is assumed to be Gaussian with parameters . In practice, we assume a diagonal covariance matrix , leaving the dependencies between the genes unmodeled within each transcriptional state. Use of diagonal covariances is justified by considerable gains in computational efficiency when the detection of distinct responses is of primary interest. It is possible, however, that such simplified model will fail to detect certain subnetworks where the transcriptional levels of the genes have strong linear dependencies within the individual transcriptional states; signaling cascades could be expected to manifest such activation patterns, for instance. More detailed models of transcriptional activity could help to distinguish the individual states in particular when the transcriptional states are partially overlapping, but with increased computational cost. A particular transcriptional response is then characterized with the triple . This defines the shape, fluctuations and frequency of the associated transcriptional state of subnetwork
. A posterior probability of each latent state can be calculated for each measurement sample from the Bayes’ rule (Equation3.3). The posterior probabilities can be interpreted as soft component memberships for the samples. A hard, deterministic assignment is obtained by selecting for each sample the component with the highest posterior probability.
The remaining task is to identify the subnetworks having such distinct transcriptional states. Detection of the distinct states is now used as a search criterion for the subnetworks. In order to achieve fast computation, an agglomerative procedure is used where interacting genes are gradually merged into larger subnetworks. Initially, each gene is assigned in its own singleton subnetwork. Agglomeration proceeds by at each step merging the two neighboring subnetworks where joint modeling of the genes leads to the highest improvement in the objective function value. Joint modeling of dependent genes reveals coordinated responses and improves the likelihood of the data in comparison with independent models, giving the first criterion for merging the subnetworks. However, increasing subnetwork size tends to increase model complexity and the possibility of overfitting, since the number of samples remains constant while the dimensionality (subnetwork size) increases. To compensate for this effect, the Bayesian information criterion (see Gelman et al., 2003) is used to penalize increasing model complexity and to determine optimal subnetwork size. The final cost function for a subnetwork is , where is the (marginal) log-likelihood of the data, given the mixture model in Equation 5.1, is the number of parameters and denotes sample size. The algorithm then compares independent and joint models for each subnetwork pair that has a direct link in the network, and merges at each step the subnetwork pair that minimizes the cost
The iteration continues until no improvement is obtained by merging the subnetworks. The combination of modeling techniques yields a scalable algorithm for genome- and organism-wide investigations: First, the analysis focuses on those parts of the data that are supported by known interactions, which increases modeling power and considerably limits the search space. Second, the agglomerative scheme finds a fast approximative solution where at each step the subnetwork pair that leads to the highest improvement in cost function is merged. Third, an efficient variational approximation is used to learn the mixture models (Kurihara et al., 2007b). Note that the algorithm does not necessarily identify a globally optimal solution. However, detection of physiologically coherent and reproducible responses is often sufficient for practical applications.
The NetResponse algorithm introduced in Publication 3 was applied to investigate transcriptional activation patterns of a pathway interaction network of 1800 genes based on the KEGG database of metabolic pathways (Kanehisa et al., 2008) provided by the SPIA package (Tarca et al., 2009) across 353 gene expression samples from 65 tissues. The two algorithms proposed in Publication 3, MATISSE+ and NetResponse were shown to outperform an unsupervised biclustering approach in terms of reproducibility of the finding. The introduced NetReponse algorithm, where the detection of transcriptional response patterns is used as a search criterion for subnetwork identification, was the best-performing method. The algorithm identified 106 subnetworks with 3-20 genes, with distinct transcriptional responses across the conditions. One of the subnetworks is illustrated in Figure 5.1; the other findings are provided in the supplementary material of Publication 3. The detected transcriptional responses were physiologically coherent, suggesting a potential functional role. The reproducibility of the responses was confirmed in an independent validation data set, where 80% of the predicted responses were detected (). The findings highlight context-specific regulation of the genes. Some responses are shared by many conditions, while others are more specific to particular contexts such as the immune system, muscles, or the brain; related physiological conditions often exhibit similar network activation patterns. Tissue relatedness can be measured in terms of shared transcriptional responses of the subnetworks, giving an alternative formulation of the tissue connectome map suggested by Greco et al. (2008) in order to highlight functional connectivity between tissues based on the number of shared differentially expressed genes. In Publication 3, shared network responses are used instead of shared gene count. The use of co-regulated gene groups is expected to be more robust to noise than the use of individual genes. The analysis provides a global view on network activation across the normal human body, and can be used to formulate novel hypotheses of gene function in previously unexplored contexts.
Gene function and interactions are often subject to condition-specific regulation (Liang et al., 2006; Rachlin et al., 2006), but these have been typically studied only in particular experimental conditions. Organism-wide analysis can potentially reveal new functional connections and help to formulate novel hypotheses of gene function in previously unexplored contexts, and to detect highly specialized functions that are specific to few conditions. Changes in cell-biological conditions induce changes in the expression levels of co-regulated genes, in order to produce specific physiological responses, typically affecting only a small part of the network. Since individual processes and their transcriptional responses are in general unknown (Lee et al., 2008; Montaner et al., 2009), data-driven detection of condition-specific responses can provide an efficient proxy for identifying distinct transcriptional states of the network, with potentially distinct functional roles.
Publication 3 provides efficient model-based tools for global, organism-wide discovery and characterization of context-specific transcriptional activity in genome-scale interaction networks, independently of predefined classifications for genes and conditions. The network is used to bring in prior information of gene function, which would be missing in unsupervised models, and allows data-driven detection of coordinately regulated gene sets and their context-specific responses. The algorithm is readily applicable in any organism where gene expression and pairwise interaction data, including pathways, protein interactions and regulatory networks, are available. It has therefore a considerably larger scope than previous network-based models for global transcriptome analysis, which rely on organism-specific annotations, but lack implementations for most organisms (Reiss et al., 2006; Tanay et al., 2004).
While biomedical implications of the findings require further investigation, the results highlight shared and reproducible responses between physiological conditions, and provide a global view of transcriptional activation patterns across the normal human body. Other potential applications for the method include large-scale screening of drug responses and disease subtype discovery. Implementation of the algorithm is freely available through BioConductor.111http://bioconductor.org/packages/devel/bioc/html/netresponse.html
The integrative approaches can be roughly classified in three categories: methods that (i) combine statistical evidence across related studies in order to obtain more accurate inferences of target variables, (ii) utilize side information in order to guide the analysis of a single, primary data source, and (iii) detect and characterize dependencies between the measurement sources in order to discover new functional connections between the different layers of genomic information. The contributions in Chapters 4 and 5 are associated with the first two categories; the contributions presented in this chapter, the regularized dependency detection framework of Publication 4, and associative clustering of Publications 5 and 6, belong to the third category.
The first general category of methods for genomic data integration consists of approaches where evidence across similar studies is combined to increase statistical power, for instance by comparing and integrating data from independent microarray experiments targeted at studying the same disease. In Publications 2 and 3, joint analysis of a large number of commensurable microarray experiments, where the observed data is directly comparable between the arrays, helps to increase statistical power and to reveal weak, shared signals in the data that can not be detected in more restricted experimental setups and smaller datasets.
However, the related observations are often not directly comparable, and further methodological tools are needed for integration. Meta-analysis provides tools for such analysis (Ramasamy et al., 2008). Meta-analysis forms part of the microarray analysis procedure introduced in Publication 1, where methods to integrate related microarray measurements across different array platforms are developed. Meta-analysis emphasizes shared effects between the studies over statistical significance in individual experiments. In its standard form, meta-analysis assumes that each individual study measures the same target variable with varying levels of noise. The analysis starts from identifying a measure of effect size based on differences, means, or other summary statistics of the observations such as the Hedges’ g, used in Publication 1. Weighted averaging of the effect sizes provides the final, combined result. Weighting accounts for differences in reliability of the individual studies, for instance by emphasizing studies with large sample size, or low measurement variance. Averaging is expected to yield more accurate estimates of the target variable than individual studies. This can be particularly useful when several studies with small sample sizes are available for instance from different laboratories, which is a common setting in microarray analysis context, where the data sets produced by individual laboratories are routinely deposited to shared community databases. Ultimately, the quality of meta-analysis results rests on the quality of the individual studies. Modeling choices, such as the choice of the effect size measure and included studies will affect the analysis outcome.
Kernel methods (see e.g. Schölkopf and Smola, 2002) provide another widely used approach for integrating statistical evidence across multiple, potentially heterogeneous measurement sources. Kernel methods operate on similarity matrices, and provide a natural framework for combining statistical evidence to detect similarity and patterns that are supported by multiple observations. The modeling framework also allows for efficient modeling of nonlinear feature spaces.
Multi-task learning refers to a class of approaches where multiple, related modeling tasks are solved simultaneously by combining statistical power across the related tasks. A typical task is to improve the accuracy of individual classifiers by taking advantage of the potential dependencies between them (see e.g. Caruana, 1997).
The second category of approaches for genomic data integration consists of methods that are asymmetric by nature; integration is used to support the analysis of one, primary data source. Side information can be used, for instance, to limit the search space and to focus the analysis to avoid overfitting, speed up computation, as well as to obtain potentially more sensitive and accurate findings (see e.g. Eisenstein, 2006). One strategy is to impose hard constraints on the model, or model family, based on side information to target specific research questions. In gene expression context, functional classifications or known interactions between the genes can be used to constrain the analysis (Goeman and Buhlmann, 2007; Ulitsky and Shamir, 2009). In factor analysis and mixed effect models, clinical annotations of the samples help to focus the modeling on particular conditions (see e.g. Carvalho et al., 2008). Hard constraints rely heavily on the accuracy of side information. Soft, or probabilistic approaches can take the uncertainty in side information into account, but they are computationally more demanding. Examples of such methods in the context of transcriptome analysis include for instance the supervised biclustering models, such as cMonkey and modified SAMBA, as well as other methods that guide the analysis with additional information of genes and regulatory mechanisms, such as transcription factor binding (Reiss et al., 2006; Savage et al., 2010; Tanay et al., 2004). Publication 3 uses gene interaction network as a hard constraint for modeling transcriptional co-regulation of the genes, but the condition-specific responses of the detected gene groups are identified in an unsupervised manner.
A complementary approach for utilizing side information of the experiments is provided by multi-way learning. A classical example is the analysis of variance (ANOVA), where a single data set is modeled by decomposing it into a set of basic, underlying effects, which characterize the data optimally. The effects are associated with multiple, potentially overlapping attributes of the measurement samples, such as disease state, gender and age, which are known prior to the analysis. Taking such prior knowledge of systematic variation between the samples into account helps to increase modeling power and can reveal the attribute-specific effects. An interesting subtask is to model the interactions between the attributes, so-called interaction effects. These are manifested only with particular combinations of attributes, and indicate dependency between the attributes. For instance, simultaneous cigarette smoking and asbestos exposure will considerably increase the risk of lung cancer, compared to any of the two risk factors alone (see e.g. Nymark et al., 2007). Factor analysis is a closely related approach where the attributes, also called factors, are not given but instead estimated from the data. Mixed effect models combine the supervised and unsupervised approaches by incorporating both fixed and random effects in the model, corresponding to the known and latent attributes, respectively (see e.g. Carvalho et al., 2008). The standard factorization approaches for individual data sets are related to the dependency-seeking approaches in Publications 4-6, where co-occurring data sources are decomposed in an unsupervised manner into components that are maximally informative of the components in the other data set.
Symmetric models for dependency detection form the third main category of methods for genomic data integration, as well as the main topic of this chapter. Dependency modeling is used to distinguish the shared signal from dataset-specific variation. The shared effects are informative of the commonalities and interactions between the observations, and are often the main focus of interest in integrative analysis. This motivates the development of methods that can allocate computational resources efficiently to modeling of the shared features and interactions.
Multi-view learning is a general category of approaches for symmetric dependency modeling tasks. In multi-view learning, multiple measurement sources are available, and each source is considered as a different view on the same objects. The task is to enhance modeling performance by combining the complementary views. A classical example of such a model is canonical correlation analysis (Hotelling, 1936). Related approaches that have recently been applied in functional genomics include for instance probabilistic variants of meta-analysis (Choi et al., 2007; Conlon et al., 2007)
, generalized singular value decomposition(see e.g. Alter et al., 2003; Berger et al., 2006) and simultaneous non-negative matrix factorization (Badea, 2008).
The dependency modeling approaches in this thesis make an explicit distinction between statistical representation of data and the modeling task. Let us denote the representations of two co-occurring multivariate observations, and , with and , respectively. The selected representations depend on the application task. The representation can be for instance used to perform feature selection as in canonical correlation analysis (CCA) Hotelling (1936), capture nonlinear features in the data as in kernelized versions of CCA (see e.g. Yamanishi et al., 2003), or partition the data as in information bottleneck (Friedman et al., 2001) and associative clustering (Publications 5-6). Statistical independence of the representations implies that their joint probability density can be decomposed as . Deviations from this assumption indicate statistical dependency. The representations can have a flexible parametric form which can be optimized by the dependency modeling algorithms to identify dependency structure in the data.
Recent examples of such dependency-maximizing methods include probabilistic canonical correlation analysis (Bach and Jordan, 2005), which has close theoretical connection to the regularized models introduced in Publication 4, and the associative clustering principle introduced in Publications 5-6
. Canonical correlations and contingency table analysis form the methodological background for the contributions in Publications4-6. In the remainder of this section these two standard approaches for dependency detection are considered more closely.
Canonical correlation analysis (CCA) is a classical method for detecting linear dependencies between two multivariate random variables (Hotelling, 1936). While ordinary correlation characterizes the association strength between two vectors with paired scalar observations, CCA assumes paired vectorial values, and generalizes correlation to multidimensional sources by searching for maximally correlating low-dimensional representation of the two sources, defined by linear projections
. Multiple projection components can be obtained iteratively, by finding the most correlating projection first, and then consecutively the next ones after removing the dependencies explained by the previous CCA components; the lower-dimensional representations are defined by projections to linear hyperplanes. The model can be formulated as a generalized eigenvalue problem that has an analytical solution with two useful properties: the result is invariant to linear transformations of the data, and the solution for any fixed number of components maximizes mutual information between the projections for Gaussian data(Kullback, 1959; Bach and Jordan, 2002). Extensions of the classical CCA include generalizations to multiple data sources (Kettenring, 1971; Bach and Jordan, 2002), regularized solutions with non-negative and sparse projections (Sigg et al., 2007; Archambeau and Bach, 2008; Witten et al., 2009), and non-linear extensions, for instance with kernel methods (Bach and Jordan, 2002; Yamanishi et al., 2003). Direct optimization of correlations in the classical CCA provides an efficient way to detect dependencies between data sources, but it lacks an explicit model to deal with the uncertainty in the data and model parameters.
Recently, the classical CCA was shown to correspond to the ML solution of a particular generative model where the two data sets are assumed to stem from a shared Gaussian latent variable and normally distributed data-set-specific noise (Bach and Jordan, 2005). Using linear assumptions, the model is formally defined as
The manifestation of the shared signal in each data set can be different. This is parameterized by and . Assuming a standard Gaussian model for the shared latent variable, and data set-specific effects where (and respectively for ), the correlation-maximizing projections of the traditional CCA introduced in Section 6.1 can be retrieved from the ML solution of the model (Archambeau et al., 2006; Bach and Jordan, 2005). The model decomposes the observed co-occurring data sets into shared and data set-specific components based on explicit modeling assumptions (Figure 6.1). The dataset-specific effects can also be described in terms of latent variables as and , allowing the construction of more detailed models for the dataset-specific effects (Klami and Kaski, 2008). The shared signal is treated as a latent variable and marginalized out in the model, providing the marginal likelihood for the observations:
where denotes the block-diagonal matrix of , , and . The probabilistic formulation of CCA has opened up a way to new probabilistic extensions that can treat the modeling assumptions and uncertainties in the data in a more explicit and robust manner (Archambeau et al., 2006; Klami and Kaski, 2008; Klami et al., 2010).
The general formulation provides a flexible modeling framework, where different modeling assumptions can be used to adapt the models in different applications. The connection to classical CCA assumes full covariances for the dataset-specific effects. Simpler models for the dataset-specific effects will not distinguish between the shared and marginal effects as effectively, but they have fewer model parameters that can potentially reduce overlearning and speed up computation. It is also possible to tune the dimensionality of the shared latent signal. Learning of lower-dimensional models can be faster and potentially less prone to overfitting. Interpretation of simpler models is also more straightforward in many applications. The probabilistic formulation allows rigorous treatment of uncertainties in the data and model parameters also with small sample sizes that are common in biomedical studies, and allows the incorporation of prior information through Bayesian priors, as in the regularized dependency detection framework introduced in Publication 4.
Contingency table analysis is a classical approach used to study associations between co-occurring categorical observations. The co-occurrences are represented by cross-tabulating them on a contingency table, the rows and columns of which correspond to the first and second set of features, respectively. Various tests are available for measuring dependency between the rows and columns of the table Yates (1934); Agresti (1992), including the classical Fisher test (Fisher, 1934), a standard tool for measuring statistical enrichment of functional categories in gene cluster analysis (Hosack et al., 2003). While the classical contingency table analysis is used to measure dependency between co-occurring variables, more recent approaches use contingency tables to derive objective functions for dependency exploration tasks. The associative clustering principle introduced in Publications 5-6 is an example of such approach.
Other approaches that use contingency table dependencies as objective functions include the information bottleneck (IB) principle (Tishby et al., 1999) and discriminative clustering (DC) (Sinkkonen et al., 2002; Kaski et al., 2005). These are asymmetric, dependency-seeking approaches that can be used to discover cluster structure in a primary data such that it is maximally informative of another, discrete auxiliary variable. The dependency is represented on a contingency table, and maximization of contingency table dependencies provides the objective function for clustering. While the standard IB operates on discrete data, DC is used to discover cluster structure in continuous-valued data. The two approaches also employ different objective functions. In classical IB, a discrete variable is clustered in such a way that the cluster assignments become maximally informative of another discrete variable . The complexity of the cluster assignments is controlled by minimizing the mutual information between the cluster indices and the original variables. The task is to find a partitioning that minimizes the cost where
controls clustering resolution. In DC, mutual information is replaced by a Bayes factor between the two hypotheses of dependent and independent margins. The Bayes factor is asymptotically consistent with mutual information, but provides an unbiased estimate for limited sample size(see e.g. Sinkkonen et al., 2005). The standard information bottleneck and discriminative clustering are asymmetric methods that treat one of the data sources as the primary target of analysis.
In contrast, the dependency maximization approaches considered in this thesis, the associative clustering (AC) and regularized versions of canonical correlation analysis are symmetric and they operate exclusively on continuous-valued data. CCA is not based on contingency table analysis, but it has close connections to the Gaussian IB (Chechik et al., 2005) that seeks maximal dependency between two sets of normally distributed variables. The Gaussian IB retrieves the same subspace as CCA for one of the data sets. However, in contrast to the symmetric CCA model, Gaussian IB is a directed method that finds dependency-maximizing projections for only one of the two data sets. The second dependency detection approach considered in this thesis, the associative clustering, is particularly related to the symmetric IB that finds two sets of clusters, one for each variable, which are optimally compressed presentations of the original data, and at the same time maximally informative of each other (Friedman et al., 2001). While the objective function in IB is derived from mutual information, AC uses the Bayes factor as an objective function in a similar manner as it is used in the asymmetric discriminative clustering. Another key difference is that while the symmetric IB operates on discrete data, AC employs contingency table analysis in order to discover cluster structure in continuous-valued data spaces.
Standard unsupervised methods for dependency detection, such as the canonical correlation analysis or the symmetric information bottleneck, seek maximal dependency between two data sets with minimal assumptions about the dependencies. The unconstrained models involve high degrees of freedom when applied to high-dimensional genomic observations. Such flexibility can easily lead to overfitting, which is even worse for more flexible nonparametric or nonlinear, kernel-based dependency discovery methods. Several ways to regularize the solution have been suggested to overcome associated problems, for instance by imposing sparsity constraints on the solution space(Bie and Moor, 2003; Vinod, 1976).
In many applications prior information of the dependencies is available, or particular types of dependency are relevant for the analysis task. Such prior information can be used to reduce the degrees of freedom in the model, and to regularize dependency detection. In the cancer gene discovery application of Publication 4, DNA mutations are systematically correlated with transcriptional activity of the genes within the affected region, and identification of such regions is a biomedically relevant research task. Prior knowledge of chromosomal distances between the observations can improve the detection of the relevant spatial dependencies. However, principled approaches to incorporate such prior information in dependency modeling have been missing. Publication 4 introduces regularized models for dependency detection based on classical canonical correlation analysis (Hotelling, 1936) and its probabilistic formulation (Bach and Jordan, 2005). The models are extended by incorporating appropriate prior terms, which are then used to reduce the degrees of freedom based on prior biological knowledge.
In order to introduce the regularized dependency detection framework of Publication 4, let us start by considering regularization of the classical correlation-based CCA. This searches for arbitrary linear projection vectors that maximize the correlation between the projections of the data sets . Multiple projection components can be obtained iteratively, by finding the most correlating projection first, and then consecutively the next ones after removing the dependencies explained by the previous CCA components. The procedure will identify maximally dependent linear subspaces of the investigated data sets. To regularize the solution, Publication 4 couples the projections through a transformation matrix in such a way that . With a completely unconstrained the model reduces to the classical unconstrained CCA; suitable constraints on can be used to regularize dependency detection.
To enforce regularization one could for instance prefer solutions for that are close to a given transformation matrix, , or impose more general constraints on the structure of the transformation matrix that would prefer particular rotational or other linear relationships. Suitable constraints depend on the particular applications; the solutions can be made to prefer particular types of dependency in a soft manner by appropriate penalty terms. In Publication 4 the completely unconstrained CCA model has been compared with a fully regularized model with ; this encodes the biological assumption that probes with small chromosomal distances tend to capture more similar signal between gene expression and copy number measurements than probes with a larger chromosomal distance; the projection vectors characterize this relationship, and are therefore expected to have similar form, . Utilization of other, more general constraints in related data integration tasks provides a promising topic for future studies.
The correlation-based treatment provides an intuitive and easily implementable formulation for regularized dependency detection. However, it lacks an explicit model for the shared and data-specific effects, and it is likely that some of the dataset-specific effects are captured by the correlation-maximizing projections. This is suboptimal for characterizing the shared effects, and motivates the probabilistic treatment.
The probabilistic approach for regularized dependency detection in Publication 4 is based on an explicit model of the data-generating process formulated in Equation (6.1). In this model, the transformation matrices , specify how the shared latent variable is manifested in each data set , , respectively. In the standard model, the relationship between the transformation matrices is not constrained, and the algorithm searches for arbitrary linear transformations that maximize the likelihood of the observations in Equation (6.2). The probabilistic formulation opens up possibilities to guide dependency search through Bayesian priors.
In Publication 4, the standard probabilistic CCA model is extended by incorporating additional prior terms that regularize the relationship by reparameterizing the transformation matrices as , and setting a prior on . The treatment is analogous to the correlation-based variant, but now the transformation matrices operate on the latent components, rather than the observations. This allows to distinguish the shared and dataset-specific effects more explicitly in the model. The task is then to learn the optimal parameter matrix , given the constraint . The Bayes’ rule gives the model likelihood
The likelihood term can be calculated based on the model in Equation (6.1). This defines the objective function for standard probabilistic CCA, which implicitly assumes a flat prior for the model parameters. The formulation in Equation (6.3) makes the choice of the prior explicit, allowing modifications on the prior term. To obtain a tractable prior, let us assume that the prior factorizes as . The first term can be further decomposed as , assuming independent priors for and . A convenient and tractable prior for is provided by the matrix normal distribution:111 where is the mean matrix, and and denote row and column covariances, respectively.
For computational simplicity, let us assume independent rows and columns with . The mean matrix can be used to emphasize certain types of dependency between and . Assuming uninformative, flat priors and , as in the standard probabilistic CCA model, and denoting , the negative log-likelihood of the model is
This is the objective function to minimize. Note that this has the same form as the objective function of the standard probabilistic CCA, except the additional penalty term arising from the prior . This yields the cost function employed in Publication 4. In our cancer gene discovery application the choice is used to encode the biological prior constrain , which states that the observations with a small chromosomal distance should on average show similar responses in the integrated data sets, i.e., . The regularization strength can be tuned with . A fully regularized model is obtained with . When , and become independent a priori, yielding the ordinary probabilistic CCA. The can be used to regularize the solution between these two extremes. Note that it is possible to incorporate also other types of prior information concerning the dependencies into the model through .
The model parameters , are estimated with the EM algorithm. The regularized version is not analytically tractable with respect to in the general case, but can be optimized with standard gradient-based optimization techniques. Special cases of the model have analytical solutions, which can speed up the model fitting procedure. In particular, the fully regularized and unconstrained models, obtained with and respectively, have closed-form solutions for . Note that the current formulation assumes that the regularization parameters are defined prior to the analysis. Alternatively, these parameters could be optimized based on external criteria, such as cancer gene detection performance in our application, or learned from the data in a fully Bayesian treatment these parameters would be treated as latent variables. Incorporation of additional prior information of the data set-specific effects through priors on and provides promising lines for further work.
The regularized models provide a principled framework for studying associations between transcriptional activity and other regulatory layers of the genome. In Publication 4, the models are used to investigate cancer mechanisms. DNA copy number changes are a key mechanism for cancer, and integration of copy number information with mRNA expression measurements can reveal functional effects of the mutations. While causation may be difficult to grasp, study of the dependencies can help to identify functionally active mutations, and to provide candidate biomarkers with potential diagnostic, prognostic and clinical impact in cancer studies.
The modeling task in the cancer gene discovery application of Publication 4 is to identify chromosomal regions that show exceptionally high levels of dependency between gene copy number and transcriptional levels. The model is used to detect dependency within local chromosomal regions that are then compared in order to identify the exceptional regions. The dependency is quantified within a given region by comparing the strength of shared and data set-specific signal. High scores indicate regions where the shared signal is particularly high relative to the data-set-specific effects. A sliding-window approach is used to screen the genome for dependencies. The regions are defined by the closest probes around each gene. Then the dimensionality of the models stays constant, which allows direct comparison of the dependency measures between the regions without additional adjustment terms that would be otherwise needed to compensate for differences in model complexity.
Prior information of the dependencies is used to regularize cancer gene detection. Chromosomal gains and losses are likely to be positively correlated with the expression levels of the affected genes within the same chromosomal region or its close proximity; copy number gain is likely to increase the expression of the associated genes whereas deletion will block gene expression. The prior information is encoded in the model by setting in the prior term . This accounts for the expected positive correlations between gene expression and copy number within the investigated chromosomal region. Regularization based on such prior information is shown to improve cancer gene detection performance in Publication 4, where the regularized variants outperformed the unconstrained models.
A genome-wide screen of 51 gastric cancer patients (Myllykangas et al., 2008) reveals clear associations between DNA copy number changes and transcriptional activity. The Figure 6.2 illustrates dependency detection on chromosome arm 17q, where the regularized model reveals high dependency between the two data sources in a known cancer-associated region. The regularized and unconstrained models were compared in terms of receiver-operator characteristics calculated by comparing the ordered gene list from the dependency screen to an expert-curated list of known genes associated with gastric cancer (Myllykangas et al., 2008). A large proportion of the most significant findings in the whole-genome analysis were known cancer genes; the remaining findings with no known associations to gastric cancer are promising candidates for further study.
Biomedical interpretation of the model parameters is also straightforward. A ML estimate of the latent variable values characterizes the strength of the shared signal between DNA mutations and transcriptional activity for each patient. This allows robust identification of small, potentially unknown patient subgroups with shared amplification effects. These would remain potentially undetected when comparing patient groups defined based on existing clinical annotations. The parameters in can downweigh signal from poorly performing probes in each data set, or probes that measure genes whose transcriptional levels are not functionally affected by the copy number change. This provides tools to distinguish between so-called driver mutations having functional effects from less active passenger mutations, which is an important task in cancer studies. On the other hand, the model can combine statistical power across the adjacent measurement probes, and it captures the strongest shared signal in the two sets of observations. This is useful since gene expression and copy number data are typically characterized by high levels of biological and measurement variation and small sample size.
Integration of chromosomal aberrations and transcriptional activity is an actively studied data integration task in functional genomics. The first studies with standard statistical tests were carried out by Hyman et al. (2002) and Phillips et al. (2001) when simultaneous genome-wide observations of the two data sources had become available. The modeling approaches utilized in this context can be roughly classified in regression-based, correlation-based and latent variable approaches. The regression-based models (Adler et al., 2006; Bicciato et al., 2009; van Wieringen and van de Wiel, 2009) characterize alterations in gene expression levels based on copy number observations with multivariate regression or closely related models. The correlation-based approaches (González et al., 2009; Schäfer et al., 2009; Soneson et al., 2010) provide symmetric models for dependency detection, based on correlation and related statistical models. Many of these methods also regularize the solutions, typically based on sparsity constraints and non-negativity of the projections (Lê Cao et al., 2009; Waaijenborg et al., 2008; Witten et al., 2009; Parkhomenko et al., 2009). The correlation-based approach in Publication 4 introduces a complementary approach for regularization that constrains the relationship between subspaces where the correlations are estimated. The latent variable models by Berger et al. (2006); Shen et al. (2009); Vaske et al. (2010), and Publication 4 are based on explicit modeling assumptions concerning the data-generating processes. The iCluster algorithm (Shen et al., 2009) is closely related to the latent variable model considered in Publication 4. While our model detects continuous dependencies, iCluster uses a discrete latent variable to partition the samples into distinct subgroups. The iCluster model is regularized by sparsity constraints on , while we tune the relationship between and . Moreover, the model in Publication 4 utilizes full covariance matrices to model for the dataset-specific effects, whereas iCluster uses diagonal covariances. The more detailed model for dataset-specific effects in our model should help to distinguish the shared signal more accurately. Other latent variable approaches include the iterative method based on generalized singular-value decomposition (Berger et al., 2006), and the probabilistic factor graph model PARADIGM (Vaske et al., 2010), which additionally utilizes pathway topology information in the modeling.
Experimental comparison between the related integrative approaches can be problematic since they target related, but different research questions where the biological ground truth is often unknown. For instance, some methods utilize patient class information in order to detect class-specific alterations (Schäfer et al., 2009), other methods perform de novo class discovery (Shen et al., 2009), provide tools for gene prioritization (Salari et al., 2010), or guide the analysis with additional functional information of the genes (Vaske et al., 2010). The algorithms introduced in Publication 4 are particularly useful for gene prioritization and class discovery purposes, where the target is to identify the most promising cancer gene candidates for further validation, or to detect potentially novel cancer subtypes. However, while an increasing number of methods are released as conveniently accessible algorithmic tools (Salari et al., 2010; Shen et al., 2009; Schäfer et al., 2009; Witten et al., 2009), implementations of most models are not available for comparison purposes. Open source implementations of the dependency detection algorithms developed in this thesis have been released to enhance transparency and reproducibility of the computational experiments and to encourage further use of these models (Huovilainen and Lahti, 2010).
Functions of human genes are often studied indirectly, by studying model organisms such as the mouse (Davis, 2004; Joyce and Palsson, 2006). Orthologs are genes in different species that originate from a single gene in the last common ancestor of these species. Such genes have often retained identical biological roles in the present-day organisms, and are likely to share the function (Fitch, 1970). Mutations in the genomic DNA sequence are a key mechanism in evolution. Consequently, DNA sequence similarity can provide hypotheses of gene function in poorly annotated species. An exceptional level of conservation may highlight critical physiological similarities between species, whereas divergence can indicate significant evolutionary changes (Jordan et al., 2005). Investigating evolutionary conservation and divergence will potentially lead to a deeper understanding of what makes each species unique. Evolutionary changes primarily target the structure and sequence of genomic DNA. However, not all changes will lead to phenotypic differences. On the other hand, sequence similarity is not a guarantee of functional similarity because small changes in DNA can potentially have remarkable functional implications.
Therefore, in addition to investigating structural conservation of the genes at the sequence level, another level of investigation is needed to study functional conservation of the genes and their regulation, which is reflected at the transcriptome (Jiménez et al., 2002; Jordan et al., 2005). Transcriptional regulation of the genes is a key regulatory mechanism that can have remarkable phenotypic consequences in highly modular cell-biological systems (Hartwell et al., 1999) even when the original function of the regulated genes would remain intact.
Systematic comparison of transcriptional activity between different species would provide a straightforward strategy for investigating conservation of gene regulation (Bergmann et al., 2004; Enard et al., 2002; Zhou and Gibson, 2004). However, direct comparison of individual genes between species may not be optimal for discovering subtle and complex dependency structures. The associative clustering principle (AC), introduced in Publications 5-6, provides a framework for detecting groups of orthologous genes with exceptional levels of conservation and divergence in transcriptional activity between two species. While standard dependency detection methods for continuous data, such as the generalized singular value decomposition (see e.g. Alter et al., 2003) or canonical correlation analysis (Hotelling, 1936) detect global linear dependencies between observations, AC searches for dependent, local groupings to reveal gene groups with exceptional levels of conservation and divergence in transcriptional activity. The model is free of particular distributional assumptions about the data, which helps to allocate modeling resources to detecting dependent subgroups when variation within each group is less relevant for the analysis. The remainder of this section provides an overview of the associative clustering principle and its application to studying evolutionary divergence between species.
The principle of associative clustering (AC) is illustrated in Figure 6.3. AC performs simultaneous clustering of two data sets to reveal maximally dependent cluster structure between two sets of observations. The clusters are defined in each data space by Voronoi parameterization, where the clusters are defined by cluster centroids to produce connected, internally homogeneous clusters. Let us denote the two sets of clusters by , . A given data point is then assigned to the cluster corresponding to the nearest centroid in the feature space, with respect to a given distance measure222 if for all . . This divides the space into non-overlapping Voronoi regions. The regions define a clustering for all points of the data space. The association between the clusters of the two data sets can be represented on a contingency table, where the rows and columns correspond to clusters in the first and second data set, respectively. The clusters in each data set are called margin clusters. Each pair of co-occurring observations maps to one margin cluster in each data set, and each contingency table cell corresponds to a pair of margin clusters. These are called cross clusters.
AC searches for a maximally dependent cluster structure by optimizing the Voronoi centroids in the two data spaces in such a way that the dependency between the contingency table margins is maximized. Let us denote the number of samples in cross cluster by . The corresponding margin cluster counts are and . The observed sample frequencies over the contingency table margins and cross-clusters are assumed to follow multinomial distribution with latent parameters and , respectively. Assuming the model of independent margin clusters, the expected sample frequency in each cross cluster is given by the outer product of margin cluster frequencies. The model of dependent margin clusters deviates from this assumption. The Bayes factor (BF) is used to compare the two hypotheses of dependent and independent margins. This is a rigorously justified approach for model comparison, which indicates whether the observations provide superior evidence for either model. Evidence is calculated over all potential values of the model parameters, marginalized over the latent frequencies. In a standard setting, the Bayes factor would be used to compare evidence between the dependent and independent margin cluster models for a given clustering solution. AC uses the Bayes factor in a non-standard manner; as an objective function to maximize by optimizing the cluster centroids in each data space; the centroids define the margin clusters and consequently the margin cluster dependencies.
The centroids are optimized with a conjugate-gradient algorithm after smoothing the cluster borders with continuous parameterization. The hyperparameters , , and arise from Dirichlet priors of the two multinomial models , of independent and dependent margins, respectively. Setting the hyperparameters to unity yields the classical hypergeometric measure of contingency table dependency (Fisher, 1934; Yates, 1934). With large sample size, the logarithmic Bayes factor approaches mutual information (Sinkkonen et al., 2005). The Bayes factor is a desirable choice especially with a limited sample size since a marginalization over the latent variables makes it robust against uncertainty in the parameter values, and because finite contingency table counts would give a biased estimate of mutual information. The number of clusters in each data space is specified in advance, typically based on the desired level of resolution. Nonparametric extensions, where the number of margin clusters would be inferred automatically from the data form one potential topic for further studies; a closely related approach was recently proposed in Rogers et al. (2010).
Publication 6 introduces an additional, bootstrap-based procedure to assess the reliability of the findings (Figure 6.3). The analysis is repeated with similar, but not identical training data sets obtained by sampling the original data with replacement. The most frequently detected dependencies are then investigated more closely. The analysis will emphasize findings that are not sensitive to small variations in the observed data.
Associative clustering was compared with two alternative methods: standard K-means on each of the two data sets, and a combination of K-means and information bottleneck (K-IB). K-means (see e.g. Bishop, 2006) is a classical clustering algorithm that provides homogeneous, connected clusters based on Voronoi parameterization. Homogeneity is desirable for interpretation, since the data points within a given cluster can then be conveniently summarized by the cluster centroid. On the other hand, K-means considers each data set independently, which is suboptimal for the dependency modeling task. The two sets of clusters obtained by K-means, one for each data space, can then be presented on a contingency table as in associative clustering. The second comparison method is K-IB introduced in Publication 5. K-IB uses K-means to partition the two co-occurring, continuous-valued data sets into discrete atomic regions where each data point is assigned in its own singleton cluster. This gives two sets of atomic clusters that are mapped on a large contingency table, filled with frequencies of co-occurring data pairs . The table is then compressed to the desired size by aggregating the margin clusters with the symmetric IB algorithm in order to maximize the dependency between the contingency table margins (Friedman et al., 2001). Aggregating the atomic clusters provides a flexible clustering approach, but the resulting clusters are not necessarily homogeneous and they are therefore difficult to interpret.
AC compared favorably to the other methods. While AC outperformed the standard K-means in dependency modeling, the cluster homogeneity was not significantly reduced in AC. The cross clusters from K-IB (Sinkkonen et al., 2003) were more dependent than in AC. On the other hand, AC produced more easily interpretable localized clusters, as measured by the sum of intra-cluster variances in Publication 6. Homogeneity makes it possible to summarize clusters conveniently, for instance by using the mean expression profiles of the cluster samples, as in Figure 6.4B. While K-means searches for maximally homogeneous clusters and K-IB searches for maximally dependent clusters, AC finds a successful compromise between the goals of dependency and homogeneity.
Associative clustering is used in Publications 5 and 6 to investigate conservation and divergence of transcriptional activity of 2818 orthologous human-mouse gene pairs across an organism-wide collection of transcriptional profiling data covering 46 and 45 tissue types in human and mouse, respectively (Su et al., 2002). AC takes as input two gene expression matrices with orthologous genes, one for each species, and returns a dependency-maximizing clustering for the orthologous gene pairs. Interpretation of the results focuses on unexpectedly large or small cross clusters revealed by the contingency table analysis of associative clustering. Compared to plain correlation-based comparisons between the gene expression profiles, AC can reveal additional cluster structure, where genes with similar expression profiles are clustered together, and associations between the two species are investigated at the level of such detected gene groups. The dependency between each pair of margin clusters can be characterized by comparing the respective margin cluster centroids that provide a compact summary of the samples within each cluster.
Biological interpretation of the findings, based on enrichment of Gene Ontology (GO) categories (Ashburner et al., 2000), revealed genes with strongly conserved and potentially diverged transcriptional activity. The most highly enriched categories were associated with ribosomal functions, the high conservation of which has also been suggested in earlier studies (Jiménez et al., 2002); ribosomal genes often require coordinated effort of a large group of genes, and they function in cell maintenance tasks that are critical for species survival. An exceptional level of conservation was also observed in a group of testis-specific genes, yielding novel functional hypotheses for certain poorly annotated genes within the same cross-cluster (Figure 6.4). Transcriptional divergence, on the other hand, was detected for instance in genes related to embryonic development.
While general-purpose dependency exploration tools may not be optimal for studying the specific issue of transcriptional conservation, such tools can reveal dependency with minimal prior knowledge about the data. This is useful in functional genomics experiments where little prior knowledge is available. In Publications 5 and 6, associative clustering has been additionally applied in investigating dependencies between transcriptional activity and transcription factor binding, another key regulatory mechanism of the genes.
The models introduced in Publications 4-6 provide general exploratory tools for the discovery and analysis of statistical dependencies between co-occurring data sources and tools to guide modeling through Bayesian priors. In particular, the models consider linear dependencies (Publication 4) and cluster-based dependency structures (Publications 5-6) between the data sources. The models are readily applicable to data integration tasks in functional genomics. In particular, the models have been applied to investigate dependencies between chromosomal mutations and transcriptional activity in cancer, and evolutionary divergence of transcriptional activity between human and mouse. Biomedical studies provide a number of other potential applications for such general-purpose methods. An increasing number of co-occurring observations across the various regulatory layers of the genome are available concerning epigenetic mechanisms, micro-RNAs, polymorphisms and other genomic features (The Cancer Genome Atlas Research Network, 2008). Simultaneous observations provide a valuable resource for investigating the functional properties that emerge from the interactions between the different layers of genomic information. An open source implementation in BioConductor333http://www.bioconductor.org/packages/release/bioc/html/pint.html provides accessible computational tools for related data integration tasks, helping to guarantee the utility of the developed models for the computational biology community.
Kernel independent component analysis.Journal of Machine Learning Research, 3:1–48, 2002.
Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI), pages 152–161. Morgan Kaufmann Publishers, San Francisco, CA, 2001.
Automated discovery of functional generality of human gene expression programs.PLoS Computational Biology, 3:e148, 2007.