Machine Learning for Genomic Data

11/15/2021
by   Akankshita Dash, et al.
16

This report explores the application of machine learning techniques on short timeseries gene expression data. Although standard machine learning algorithms work well on longer time-series', they often fail to find meaningful insights from fewer timepoints. In this report, we explore model-based clustering techniques. We combine popular unsupervised learning techniques like K-Means, Gaussian Mixture Models, Bayesian Networks, Hidden Markov Models with the well-known Expectation Maximization algorithm. K-Means and Gaussian Mixture Models are fairly standard, while Hidden Markov Model and Bayesian Networks clustering are more novel ideas that suit time-series gene expression data.

READ FULL TEXT VIEW PDF

page 10

page 31

page 32

page 36

page 38

page 40

page 41

page 42

12/25/2013

Model-based clustering with Hidden Markov Model regression for time series with regime changes

This paper introduces a novel model-based clustering approach for cluste...
08/16/2019

Regression on imperfect class labels derived by unsupervised clustering

Outcome regressed on class labels identified by unsupervised clustering ...
05/02/2016

Tradeoffs for Space, Time, Data and Risk in Unsupervised Learning

Faced with massive data, is it possible to trade off (statistical) risk,...
10/30/2019

Hidden Markov Models for sepsis detection in preterm infants

We explore the use of traditional and contemporary hidden Markov models ...
09/26/2013

Unsupervised Learning of Noisy-Or Bayesian Networks

This paper considers the problem of learning the parameters in Bayesian ...
07/23/2019

Hidden Markov Models derived from Behavior Trees

Behavior trees are rapidly attracting interest in robotics and human tas...
12/27/2020

Generalized Categorisation of Digital Pathology Whole Image Slides using Unsupervised Learning

This project aims to break down large pathology images into small tiles ...

Acknowledgements

First and foremost, I would like to express my utmost gratitude to my supervisors; Assoc. Professor Vincent Tan, for introducing me to machine learning in MA4270, and Dr. Huili Guo for breaking down complex bioinformatics concepts into simple ones for me to understand. I am extremely grateful for their guidance during the entire period of my honours project. Throughout the year, they have been understanding of my strengths and weaknesses and tolerant of my mistakes, guiding me along when I hit roadblocks, all the while suggesting new ideas and patiently helping me clarify my doubts. I have learnt so much from both of them, both about academia and life, and without their constant guidance, this thesis would not have been possible. I couldn’t have asked for better mentors. Thank you.

I would also like to thank the people in Dr. Huili Guo’s lab, especially Indrik and Shawn, for being so helpful when I was struggling with certain concepts. I also express my appreciation to Zhao Yue for being a helpful companion on this journey, and wish him well for his FYP.

To my friends - Spatika, Srishti, Megha, Bushra and Kai Ting - thank you for your support and encouragement. You have listened to my never-ending rants, put up with my idiosyncrasies, seen me at my lowest points and yet still stuck by me.

Last but not the least, I would like to thank my family. Despite living so far, you have always managed to be there for me. I’m so grateful for your support and encouragement throughout my undergraduate years. To my sister and brother-in-law; despite your busy lives and the personal tragedy you faced last year, you continued lending a listening ear to my insignificant complaints. To my Mom and Dad - you have always been my pillars of support.You have never pressured me and always encouraged me to do my best. From sending me home-cooked meals, visiting me in Singapore when I couldn’t make it home, and offering me life advice when I was at crossroads, you have helped me up when I felt like I was on the edge of failure. Thank you so much.

Akankshita Dash

Abstract

This report explores the application of machine learning techniques on short time-series gene expression data. Although standard machine learning algorithms work well on longer time-series’, they often fail to find meaningful insights from fewer timepoints.

In this report, we explore model-based clustering techniques. We combine popular unsupervised learning techniques like K-Means, Gaussian Mixture Models, Bayesian Networks, Hidden Markov Models with the well-known Expectation Maximization algorithm. K-Means and Gaussian Mixture Models are fairly standard, while Hidden Markov Model and Bayesian Networks clustering are more novel ideas that suit time-series gene expression data.

Contents

1 Introduction

Machine learning

is a field of artificial intelligence that uses statistical techniques to give a computer the ability to learn from data, without explicitly programming it to do so.

Genomics is a branch of molecular biology concerned with the structure, function, evolution, and mapping of an organism’s complete set of DNA.

In this project, we apply machine learning techniques to viral genomic data recovered from human cells, and attempt to derive useful insights about the EV-A71 virus, also known as hand-foot-and-mouth disease (HFMD).

1.1 Machine Learning

With the large volume of data available nowadays, it is impossible to apply statistical techniques and derive useful insights without making use of a computer.

Considered one of the biggest innovations since the microchip, machine learning enables humans to build models and capture patterns within data. It allows one to define simple rules to improve the performance of these models. The applications of machine learning are manifold - from spam filters in e-mail to online banking, it is omnipresent. Machine learning’s presence is growing, and its use cases continue increasing.

A machine learning model can be defined by its training data and the set of hypotheses functions.

Machine learning is broadly divided into the following 3 categories:-

In supervised learning, the training data are labelled, i.e. each data point is attached with its true value, or ground truth. Typical problems include classification (binary/multi-class) and regression. E.g. If given a training dataset consisting of pictures of cats and dogs, our model accurately predicts whether a new image is that of a cat or a dog. This is a classification problem. Another simple example is of a dataset with people’s heights and weights. Given a new person’s height, we should be able to accurately predict weight. This is a regression problem.

In unsupervised learning

, unlike supervised learning, the training data are unlabelled. The number of labels may or may not be specified beforehand. Unsupervised learning is commonly used as an initial exploratory technique to draw useful patterns from the dataset. E.g. Given many different books, classifying them into different categories, or

clusters, can help us identify the genres they belong to.

Reinforcement learning is different from both supervised and unsupervised learning. It focuses on maximising reward potential from the data by balancing the trade-off between exploration and exploitation. E.g. In recommendation engines, the items showed to a particular user may be ordered based on how much they liked previous products or how different they are from previous products. This is done to maximise the number of clicks, i.e. the reward.

Since the data that we have in this project are unlabelled, our focus is on unsupervised learning.

1.2 Genomics

Genomes contain the complete set of genes present in an organism, and genomics describe how this genetic information is stored and interpreted in the cell.

1.2.1 Key Definitions and Terms

  • Nucleobases: Components of nucleotides, that are the building blocks of DNA - Adenine (A), Thymine (T), Cytosine (C), Guanine (G). Additionally, Uracil (U) is used instead of Thymine (T) for RNA.

  • Codons: A sequence of three DNA or RNA nucleotides that corresponds with a specific amino acid or stop signal during protein synthesis. For eukaryotes, there are a total of 64 such codons, which map to 20 amino acids (and stop signals) through the genetic code.

  • Amino acid: Organic molecules that make up the building block of proteins. There are 20 canonical amino acids.

  • DNA: Deoxyribonucleic acid, a double helix molecule, that contains the genetic material for all organisms on Earth (including viruses).

  • RNA: Ribonucleic acid, a single helix, polymeric molecule that is essential in various biological roles in coding, decoding, regulation and expression of genes. Biologically active RNAs, including transport, ribosomal and small nuclear RNA (tRNA, rRNA, snRNAs) fold into unique structures guided by complementary pairing between nucleotide bases. Cellular organisms use messenger RNA (mRNA) nucleotides to direct synthesis of specific proteins.

  • Protein: Sequences of amino acids that carry out the majority of cellular functions such as motility, DNA regulation, and replication.

  • Transcription: The process in which a DNA sequence of a gene is rewritten, or transcribed, to make an RNA molecule. (In eukaryotes, the RNA polymerase makes a strand of mRNA from DNA).

  • Translation: The sequence of the mRNA is decoded, or translated, to specify an amino acid sequence (i.e. protein).

  • Gene: A region of DNA (deoxyribonucleic acid) coding either for the messenger RNA encoding the amino acid sequence in a polypeptide chain or for a functional RNA molecule.

  • Chromosome: Cellular structures that contain genes. A chromosome comprises of a single DNA molecule that may be either circular or linear.

  • Virus: A small infectious agent that replicates only inside the living cells of other organisms.

  • Gene Expression: The generation of a functional gene product from the information encoded by a gene, through the processes of transcription and translation. Gene products are often proteins. 111Non-protein coding genes can encode functional RNA, including ribosomal RNA (rRNA), transfer RNA (tRNA) or small nuclear RNA (snRNA)

1.2.2 Central Dogma

Figure 1: Central dogma of Molecular Biology

The central dogma of biology states that: DNA is transcribed by RNA polymerases into mRNA (messenger RNA), which is read by ribosomes to generate protein through translation. [crick1970central]

In this project, our focus is on translation. More formally, we will be studying the translational interaction between the EV-A71 virus and its host cell, by observing gene expression values at different timepoints.

1.2.3 Virus-Host Interaction

When a host gets infected with a virus, the virus hijacks the host cell’s translation machinery, which affects protein production.

In the context of EV-A71, the viral replication process begins when a virus infects its host by attaching to the host cell and penetrating the cell wall or membrane. The viral genome then hijacks the host cell’s translation mechanism, forcing it to replicate the viral genome by producing viral proteins to make new capsids (the protein shell of a virus that protects it). The viral particles are then assembled into virions. These virions burst out of the host cell during a process called lysis, killing the host cell. Some viruses take a portion of the host’s membrane during the lysis process to form an envelope around the capsid. [ScitableVirus]

After infection with any virus, overall translation of all genes decreases.

Figure 2: EV-A71 (chrE) infection over time*

*where Chromosome E (chrE) specifies the viral genes

In Figure 2, Following viral replication, the new viruses may go on to infect new hosts. Many viruses such as influenza, chicken pox, AIDS, the common cold, and rabies cause diseases in humans.

1.2.4 Hand, Foot, and Mouth Disease

Hand, foot, and mouth disease (HFMD) is a common infection caused by a group of viruses. The one we will be examining in this report is the enterovirus A71, or the EV-A71 virus. HFMD mostly affects small children, occasionally causing symptoms to develop in adults. It typically begins with a fever, which is followed by rashes and bumps on other parts of the body. Currently, there is no treatment that specifically targets the disease, nor is there a vaccine that is approved for use in Singapore [ang2009epidemiology].

Figure 3: EV-A71 life cycle
1.2.4.1 GTPases

Viruses depend on host cells for propagation and dissemination. Host cell mechanisms that allow viral entry, facilitate viral replication, and enable viral egress, are targeted for exploitation by viral pathogens [amorim2018comprehensive].

Guanosine-5’-triphosphate (GTP) is a purine nucleoside triphosphate. It is one of the building blocks needed for the synthesis of RNA during the transcription process, and is involved in energy transfer within the cell. GTPases are a class of proteins that bind and hydrolyze GTP. GTPases include Rab, Ras, Rac, Ran, Rho, Arf, and Sar proteins.

Our focus is on the proteins Rab and Rho. Emerging evidence demonstrates that viruses have evolved numerous strategies to modulate Rab proteins’ functions [amorim2018comprehensive].

In polioviruses, the 2C protein possesses weak GTPase activity. The 2C protein in EV-A71 viruses plays an important role in viral replication [wang2018divergent]. Our hypothesis is that a certain member protein of GTPase is important for establishing the infection in EV-A71 viruses, and could be linked to Rho GTPases. We focus on the translation level changes in attempt to find information supporting our conjecture.

1.2.5 Objective

When a virus infects a cell, most of the processes in the cell get shut down and some proteins do not get generated. There are groups of:-

  1. Genes that escape this shut down by the virus

  2. Genes are affected by the shutdown

  3. Genes that are not influenced by the virus

[1] is the most important, as these are the ones that the virus needs for its own production.

Thus, our objective is to identify these groups from the set of genes that we have available. We do this by using unsupervised learning techniques on gene expression data to generate clusters.

1.3 Data Generation through Genome Sequencing Techniques

RNA sequencing refers to techniques used to determine the sequence of RNA molecules. It includes high-throughput shotgun sequencing of cDNA molecules obtained by reverse transcription from RNA, and next-generation sequencing technologies to sequence the RNA molecules within a biological sample in an effort to determine the primary sequence and relative abundance of each RNA molecule.

For this project, we primarily use gene expression data that has been generated using the following techniques:-

  • RNASeq

    : RNASeq uses next-generation sequencing (NGS) to reveal the presence and total quantity of RNA in a biological sample at a given moment.

  • Ribosome Profiling (RPF): RPF uses specialised messenger RNA (mRNA) sequencing to determine which mRNAs are being actively translated. Unlike RNASeq, which sequences all of the mRNA of a given sequence present in a sample, RPF only targets mRNA sequences that are being actively translated.

1.3.1 Methodology

We posit that host factors regulated in viral translation may be important for establishing viral infection. The infection and analysis was carried out as follows -

  1. Infect cells at various timepoints

  2. Collect lysates post infection (cold synced) at various timepoints - 0 hours post infection (h.p.i), 2 h.p.i, 4 h.p.i, 6 h.p.i, 8 h.p.i

  3. Perform clustering analysis to identify relevant genes

  4. Validate changes in genes

  5. Test for functions of genes in viral infection

2 Literature Review

2.1 Clustering Time Series Gene Expression Values

Clustering techniques are essential in the data mining process, as they reveal natural structures and identify interesting patterns in the underlying data.

For gene expression data observed at different time-points, clustering can help identify biologically relevant groups of genes and samples.

Through the principle of guilt-by-association [quackenbush2003microarrays], if we find a known cluster that contains some uncharacteristic genes, we can hypothesise that these genes are significant.

As mentioned in [d2005does], there is no one-size-fits-all solution to clustering, or even a consensus of what an ideal clustering should be like. Clusters may have arbitrary shapes and sizes, and each clustering criterion imposes a certain structure on the data. If the data happen to conform to the requirements of a particular criterion, the true clusters are recovered; however, very rarely is this the case. Each algorithm imposes its own set of biases on the clusters it determines. Most (reasonable) clustering algorithms would yield similar results on synthetically constructed datasets; however, in practice, they can give widely differing results on real-world,noisy, gene expression data.

For temporal gene expression data, the most commonly used clustering algorithms [friedman2000using, jiang2004cluster, schliep2003using] are

  1. Model Based Clustering

    • K-Means

    • Gaussian Mixture Models

    • Hidden Markov Models

    • Bayesian Networks

  2. Self Organising Maps

For all our model based clustering techniques, we determine partitions by using the expectation-maximisation (EM) algorithm (described in 2.2.1

) for maximum likelihood estimates of model parameters .

In this report, we will use hierarchical clustering for exploratory analysis, and apply all the model based clustering techniques to the different datasets we are provided with.

2.2 Algorithms and Models

2.2.1 Expectation Maximisation (EM) Algorithm

The expectation maximisation (EM) algorithm is a general technique for finding maximum likelihood solutions for probabilistic models having latent variables [bishop2006pattern]. EM is used for all the model based techniques covered in this report.

2.2.1.1 Procedure

Consider a probabilistic model in which we collectively denote all the observed variables by and the hidden variables by

. The joint distribution is governed by a set of parameters, denoted

. We try to maximise the likelihood function, that is given by

(1)

If is continuous, we replace the summation with integration as appropriate. The maximum likelihood estimator is

Direct optimisation of is difficult due to the presence of latent variables, so we attempt to optimise the expected value of the log-likelihood of the data given our model. We are now ready to define the general EM algorithm in algorithm 1.

Initialisation: For , make an initial estimate of the parameters .
E Step ( iteration): Evaluate p(|,

), i.e find the conditional probability distribution of the latent variables given our current model

and the data .
M Step ( iteration): Update as
(2)
where
(3)
End: Terminate when one of the following criteria is met
  1. Iterate till reaches the maximum number of iterations specified

  2. Till the parameter estimates stop changing

  3. Till the likelihood function stop changing

Algorithm 1 Expectation Maximisation
2.2.1.2 Limitations

Convergence of the EM algorithm is guaranteed by the Majorisation-Minimisation technique [TanVYF]. However, the EM Algorithm typically converges to the local optima. Thus, it needs to be run multiple times in order to find the best fitting model and its parameters.

2.2.2 K-Means

K-Means is an algorithm to identify groups, or clusters, of data points in a multidimensional space [bishop2006pattern].

We group data points based on their feature similarities according to a given distance metric (Euclidean Distance in our case), by minimising a given cost function.

Given a set of data points and fixed number of clusters , the algorithm works as follows [TanVYF]-

Step 0, Initialisation: For , initialise centers randomly.
Step , Assignment (E) Step: Assign each point to its closest mean
Step , Update (M) Step: Recompute ’s as means of the assigned points:
End: Terminate when cluster assignments do not change.
Algorithm 2 K-Means Clustering

For a given partition , where

this algorithm minimises the overall squared error:

(4)

where are the clusters and are cluster means. The two steps of updating the clusterings and re-estimating the centers correspond to the E step and M step of the EM Algorithm respectively.

Equation (2.2.2) is also known as the cost function.

The main problem of k-means is its dependency on the initially chosen centers. If the centers are chosen arbitrarily, then the algorithm converges very slowly or results in poor clustering. However, this can be solved by using the K-Means++ [arthur2007k] algorithm, which selects centers that are far apart from each other in a probabilistic manner.

Through K-Means++, we choose the initial centers as follows [TanVYF]-

  1. Take a random data point chosen uniformly from dataset .

  2. Calculate new center , by choosing

    with probability

    , (where denotes the shortest distance from the data point to its closest, already chosen center)

  3. Repeat the previous step till we have centers.

2.2.2.1 Limitations

Due to its ease of implementation and termination in a finite number of steps, K-Means is one of the most popular (and usually one of the first) algorithms applied for unsupervised learning problems.

However, due to the nature of the distance function chosen (Euclidean), K-means doesn’t capture temporal dependencies in the data. Thus, it is not entirely suitable for our time series dataset. Moreover, k-means does hard clustering, i.e. a point can belong to only one cluster, while in real life data, a point can belong to more than one cluster.

2.2.3 Gaussian Mixture Models

Gaussian mixture models (GMMs) are mixtures, or linear combinations, of models with Gaussian distributions. They are widely used in data mining, statistical analysis and pattern recognition problems

[bishop2006pattern].

Figure 4: Mixture of 3 Gaussians in one dimension

A Gaussian mixture model is represented by

(5)

where the equation above represents a mixture of Gaussians, is a component of the mixture model (each having its own mean and covariance , and are the component weights, such that .

To use GMMs for clustering, we assume that our data are generated i.i.d from a mixture of Gaussians, and we apply EM to estimate the means, covariances, and weights, and thus determine the clusterings. We achieve this by introducing a latent variable in (5). This latent variable has a 1-of-K representation - thus and (for each ), and find and , to get

(6)

where and .

Another quantity that is important is the conditional probability of given , also known as the responsibility that component takes for producing the observation , denoted by

(7)

These formulas containing latent variable help us formulate EM for GMM, given in 3.

Init: Initialise the means and covariances and and mixing co-efficients , and evaluate the initial value of the log-likelihood.
Step l, E step: Evaluate the responsibilities from (7) using the current parameter values.
Step l, M Step: Re-estimate the parameters using the current responsibilities.
where
Evaluate the log-likelihood
and terminate if one of the following criteria are met-
  1. Paramater estimates have converged (stopped changing).

  2. Log likelihood has converged.

  3. Maximum umber of iterations have been reached.

otherwise return to Step 2.
Algorithm 3 EM for Gaussian Mixture Models

The clusterings are indicated by the latent variable .

2.2.3.1 Covariance Matrix

Having different settings of the covariance matrix for each component help in restricting the number of parameters that need to be estimated (thus optimising the runtime of the algorithm). However, some of the settings (diagonal, tied and spherical) restrict the contour of the clusters obtained, as seen in figure 5. In the diagonal setting, the elliptical clusters obtained are aligned with the co-ordinate axes, while the spherical clusters are circular, while all clusters have the same shape in the tied setting. Assuming we have components, here are 4 different covariance settings, summarised in the table 1.

Covariance Description Parameters
Full 20cm
Each component has its own
general covariance matrix
Diagonal 20cm
Each component has a
diagonal covariance matrix
Spherical 20cm
Each component

has a single variance

Tied 20cm
All components share the
same general covariance matrix
Table 1: Covariance Settings for GMM
Figure 5: Different covariance settings for the 2D Iris dataset
2.2.3.2 Limitations

As discussed for previous models, GMM converges to a local optima, so we need to run the model multiple times to determine the best fit. Moreover, GMM takes a long time to run - however, we can indicate different parameter settings for the covariance matrix, based on what we know (or assume) about the data. This helps keeps the number of parameters that we are estimating under control and reduces the computation time.

2.2.4 Hidden Markov Models

In a first-order Markov chain describing a sequence of possible events, the probability of each event depends only on the state attained in the previous event.

Figure 6: First-order Markov Chain

A hidden Markov model can be viewed as a Markov chain of latent variables, with each observation conditioned on the state of the corresponding latent variable. The states of the Markov chain are not observed directly, i.e. the Markov chain itself remains hidden [bishop2006pattern].

Figure 7: Hidden Markov Model

A single time slice of HMM corresponds to a mixture distribution, with component densities given by . Therefore HMMs can also be interpreted as an extension of a mixture model in which the choice of mixture component for each observation is not selected independently but depends on the choice of component for the previous observation [bishop2006pattern].

HMMs are widely used in speech recognition, natural language modelling, on-line handwriting recognition and for the analysis of biological sequences such as proteins and DNA [bishop2006pattern]. Thus, we can use a Hidden Markov Model (HMM) to model temporal dependencies in the data.

An HMM contains 3 essential components - observed sequence of data , hidden states , transition probabilities and emission probabilities .

The hidden states in an HMM are analogous to latent variables in GMMs, responsible for generating the given observations. The probability distribution of a hidden state

depends upon the state of the previous hidden state . This is encoded by a transition probability matrix. For an HMM with hidden states, the transition probability matrix has independent parameters.

The emission probabilities are conditional distributions of the observed variables are , where is a set of parameters governing the distribution. The may be given by Gaussians if the elements of are continuous variables, or by conditional probability tables if is discrete [bishop2006pattern, glass_zue_2003].

There are 3 basic problems tackled by hidden Markov models [TanVYF, glass_zue_2003] -

  1. Given an observation sequence and a model , compute the probability . This is achieved using the Forward-Backward algorithm.

  2. Given an observation sequence , compute the most likely observed state sequence. This is done using the Viterbi algorithm.

  3. Adjust the model parameters of to maximise using the Baum-Welch algorithm.

2.2.4.1 HMM-based Clustering

We can apply the EM algorithm defined in Algorithm 1 to do clustering using HMMs.

The EM procedure for HMM based clustering is fairly complex, as we need to use a sub-EM procedure (Baum-Welch algorithm) in the M step, times.

Starting from an initial collection of HMMs, EM iteratively finds cluster assignments that maximise the joint likelihood of the clustering.

Initialisation: Given sequences , where [1,], we randomly generate HMMs and a partitioning , where the assignment to clusters is done to maximise the following objective function:

(8)

where defines the likelihood function, i.e. the probability density for generating sequence by model .

Assuming that we have a collection of HMMs, our algorithm is as follows:

  1. Iteration ( {1,2,…}):

    1. E Step: Generate a new partitioning of the sequences by assigning each sequence to the model for which the likelihood is maximal.

    2. M Step: Re-estimate parameters using the Baum-Welch algorithm [glass_zue_2003]: using their start parameters from the previous iteration and the sequences assigned to each model.

  2. Stop, if one of the following takes place -

    1. If the improvement of the objective function is below a given threshold

    2. There are no more re-assignments

    3. The number of iterations is reached

2.2.4.2 Limitations

The biggest limitation with HMM-based clustering is convergence to a local optima, so the algorithm needs to be run multiple times with different random initialisations to find the best fit. However, like GMM training, HMM based clustering takes significant computation time - especially due to the Baum-Welch EM procedure in the M step of the clustering.

2.2.5 Bayesian Networks

Bayesian networks are probabilistic graphical models that represent a set of variables and their conditional dependencies via directed acyclic graphs.

In [TanVYF], we studied the Chow-Liu algorithm to learn tree-structured graphical models. In [pham2009unsupervised], the authors combined Chow-Liu Trees and Expectation Maximisation to derive a model-based clustering method, called the CL Multinet Classifier.

If there are classes, then CL trees are learned. Each tree distribution approximates the joint probability distribution of the attributes, given a specific class. The root attribute has no parents, while the other attributes can have at most ONE parent, i.e. if the attributes are expressed as r.vs , then has only one attribute, and .

An inherent advantage of applying Bayesian networks case is that we can infer the structure behind each cluster too. For each cluster, we can see which timepoints affect the other.

Using Bayes Theorem, the posterior probability of a Chow Liu Tree can be expressed as

(9)

We observe that the denominator is constant with respect to the class, so we express it as .

Then, the previous formula can be re-written as

(10)

The edge weights of a Chow Liu tree are encoded by the mutual information, which is derived from the empirical probability distributions observed from the data.

(11)

A maximum weighted spanning tree (either Prim’s/Kruskal’s) algorithm is then used to find the Chow-Liu tree for each class.

Unsupervised training of the CL Trees (henceforth referred to as CL Multinets) is carried out assuming that the data have been generated from a mixture of Bayesian networks. This mixture can be described as

(12)

where

(13)

The are the mixing coefficients and the are the joint probability distributions of the CL Multinets. Given observations, clusters (CL Multinets), and initial partitions P={P,P,…,P}, the EM procedure for the iteration is detailed in algorithm 4.

E-Step: For and compute the posterior probabilities for belonging to P.
(14)
C-Step: Update the partition ={P,P,…,P} by assigning each to the cluster that provides the max posterior probability.
M-Step: For , maximise the Classification Maximum Likelihood (CML) criteria and re-estimate the parameters using the new partitions .
(15)
where are the number of samples belonging to the cluster, and are the cluster weights. The parameters for the second term are re-estimated (maximised) using the following formula:
(16)
The first term (after some manipulation) contains the mutual information computed by the observed data (empirical distributions). Using the maximum weighted spanning tree algorithm, we maximise (15), resulting in a new tree distribution.
Terminate: If any of the following criteria are reached:
  1. Maximum number of iterations reached.

  2. CML converges.

  3. Parameters converge.

Algorithm 4 Expectation Maximisation for Bayesian Networks
2.2.5.1 Limitations

Although Bayesian Networks are directed acyclic graphs and thus use fewer parameters than HMMs, unlike HMMs, we need to discretise the data in order to use the empirical frequencies to estimate joint probability distributions. Discretisation also results in loss of information, as the resultant dataset is dependent upon the number of bins chosen. Moreover, just like HMM-based clustering, there exist only general information theoretic criteria to determine the number of clusters , which is not always be applicable [pham2009unsupervised].

2.2.6 Hierarchical Clustering

All the previous models that we covered can be classified under partitional clustering. The other category, hierarchical clustering, seeks to build a hierarchy of clusters using a given distance metric for datum and a similarity criteria for the clusters obtained. However, unlike partitional clustering, there exist no definitive statistical techniques determine the optimal number of clusters - instead, we need to examine the resulting hierarchical structure (dendogram) and visually determine the optimal number of clusters [maimon2005data].

There are two approaches to hierarchical clustering -

  • Agglomerative: Each datum initially represents a cluster of its own. Clusters are then successively merged until the desired cluster structure is obtained.

  • Divisive: All objects initially belong to one cluster. This main cluster is then divided into sub-clusters, which are successively sub-divided into their own sub-clusters, and this process continues until the desired cluster structure is obtained.

We set the distance metric (for within cluster distances) as Euclidean. The linkage criterion in hierarchical clustering specifies the difference between any two sets (clusters) obtained. It determines which clusters to merge by calculating the differences between two clusters as a function of the pairwise distances between observations. The different types of linkages are-

  • Single: the distance between two clusters is equal to the shortest distance from any member of one cluster to any member of the other cluster.

  • Average: the distance between two clusters is equal to the average distance from any member of one cluster to any member of the other cluster.

  • Complete: the distance between two clusters is equal to the longest distance from any member of one cluster to any member of the other cluster.

  • Ward: the distance is defined as the error function of the unified cluster minus the error functions of the individual clusters, where the error function is the average distance of each data point in a cluster to the cluster centroid.

Figure 8: Hierarchical Clustering Dendogram for Single-Linkage

For this report, we use agglomerative hierarchical clustering not in model building, but in Section 4.1 for feature engineering , so as to obtain a rough estimate of the number clusters we may expect from the data.

2.3 Evaluation Metrics for Model Selection

Given data, model selection is the task of selecting the best statistical model from a set of candidate models.

In our clustering analysis, the main conundrum is determining the number of clusters.

The estimation of the true number of classes has been recognised as “one of the most difficult problems in cluster analysis" [bock1996probabilistic]. The correct choice of is often ambiguous - sometimes, choosing the best fitting model is not the wisest choice. Consider increasing without any penalty - this will keep reducing the error but the model is less than ideal (e.g. In K-Means, if we have clusters for points, each point is its own center, reducing the cost function to 0).

Moreover, given multiple candidate models having similar accuracy, the simplest model is most likely to be the best choice (Occam’s razor).

Thus, the optimal choice of balances accuracy and number of parameters required for the model. This is tackled in 2.3.1 and 2.3.2.

A secondary conundrum is comparing two different clustering models, given that we obtain "ground truth" values for some of the clusters. This is handled in 2.3.3.

2.3.1 Information Criteria

When a statistical model is used to represent the process that generated the data, the representation will almost never be exact. Some information is thus lost by the model. Bayesian Information Criteria (BIC) and Akaike Information Criteria (AIC) [burnham2004multimodel] deal with estimating this lost information, and calculate the trade-off between the goodness of fit of the model and the complexity of the model.

2.3.1.1 BIC and AIC

BIC and AIC penalise the total number of parameters. Both are functions of the maximised log-likelihood of the model and the estimated number of parameters - however, BIC penalises model complexity more heavily. We define the AIC and BIC as follows-

where is the maximised log-likelihood of the model , is the number of free parameters and is the number of data points. For a range of values, we calculate AIC/BIC, and choose the for which this criterion is maximum. Thus, higher AIC or BIC values indicate better fitting models.

2.3.2 Elbow Method

The elbow method [kodinariya2013review] is a popular method to determine the true number of clusters by plotting the cost function for a given clustering algorithm against different , and observing the point where the graph makes a sharp elbow shape, as seen in Figure 9. More formally, this method looks at the percentage of variance explained as a function of the number of clusters. In essence, the elbow method observes the point of sharpest decrease in the cost function, i.e. the most improvement to the cost function, and estimates that to be the ideal number of clusters.

Figure 9: Elbow method for k-means clustering

However, the limitation of this method is that the elbow point cannot always be unambiguously identified - sometimes there is no elbow, or sometimes there are several elbows.

2.3.3 Misclassification Error (ME) Distance

ME distance is used to calculate the similarity between two different clusterings of the same data. Given the "ground truth" (i.e. true classes of each point), we can compare the clustering obtained by our technique with the ground truth to see how our model compares. ME distance thus represents the well known cost of classification, minimised over all permutations of the labels [meila2005comparing].

It is calculated as follows -

Consider two clusterings and

. We define the confusion matrix of

and as a matrix , with . We consider the case where both models cluster points, and .

Then the ME distance is defined as

where contains all permutations of the label .

We want the ME Distance between our models and the ground truth to be close to 0.

2.3.4 Gene Ontology

Gene Ontology (GO) is a framework for the model of biology, unifying the representation of gene and gene product attributes across all species. We define concepts and classes used to describe gene functions, and observe the relationships between them.

The ontology covers 4 categories-

  1. Biological Process: operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms.

  2. Cellular Component: the parts of a cell or its extracellular environment.

  3. Molecular Function: the elemental activities of a gene product at the molecular level.

  4. KEGG Pathway: pathways representing molecular interaction, reaction and relation networks for various biological processes.

We use DAVID [huang2007david] to perform gene ontology analysis on the clusters we generate.

Each cluster obtained from our model(s) contains genes that share similar biological properties. We perform GO analysis on all these different clusters, and attempt to identify terms that are related to infection - like viral process, translation, DNA repair, etc. If we do find terms related to viral process, we use guilt-by-association to find other significant terms.

Figure 10: Sample GO for a cluster

p-value is the probability of seeing at least x number of genes out of the total n genes in the list annotated to a particular GO term, given the proportion of genes in the whole genome that are annotated to that GO term. In essence, the GO terms shared by the genes in the user’s list are compared to the background distribution of the annotation. Thus, the smaller the p-value is, the more significant the particular GO term associated with the group of genes is (i.e. the less likely the observed annotation of the particular GO term to a group of genes occurs by chance) [huang2007david].

3 Exploratory Analysis

3.1 Data Description

We have 3 datasets consisting of unlabelled human gene expression values, obtained after infection with the EV-A71 virus.

  • RNASeq: 12,340 samples of RNASeq data at 5 timepoints

  • RPF: 11,746 samples of RPF data at 5 timepoints.

  • TE: 11,633 samples of TE data at 5 timepoints.

The attributes are gene expression values hours post infection (h.p.i), where .

As described in Section 1.3, RNASeq represents the total amount of genes within the cell, while RPF represents the amount of genes being translated. TE measures the rate at which genes get translated.

3.1.1 Data Preprocessing

Figure 11: RNASeq Data

cdRPKM values are raw gene expression values; cdReads are the number of times a gene was read (as per the biological technique used), while the AccNum and GeneName represent how to ID the gene.

We first filter the dataset for cdReads to remove noise; this removes genes with very small reads, likely generated due to experimental errors.

Then, we transform the cdRPKM values to , which is a standard practice in bioinformatics.

Figure 12: TE Data

We can calculate TE data in Figure 12 by dividing the raw RPF value by the raw RNASeq value, and transforming it to . Alternatively, we can subtract the RNASeq value from the RPF value to also obtain TE.

3.2 Initial Analysis

For the RNASeq and RPF datasets, we plot individual timepoint values in Figures 13 and 14.

(a) 0 h.p.i
(b) 2 h.p.i
(c) 4 h.p.i
(d) 6 h.p.i
(e) 8 h.p.i
Figure 13: Distribution of values of RNASeq data
(a) 0 h.p.i
(b) 2 h.p.i
(c) 4 h.p.i
(d) 6 h.p.i
(e) 8 h.p.i
Figure 14: Distribution of values of RPF data

From the plots, we infer that all timepoints follow a Gaussian distribution. we posit that Gaussian Mixture Models and Hidden Markov Models with continuous emissions will be useful in clustering our time-series gene expression values.

For both datasets, we observe that the median keeps shifting to the left for later timepoints. This is due to the infection spreading- with the passage of time, more viral translation leads to an overall decrease in gene expression.

To standardise the medians for all timepoints, we again follow a standard practice in bioinformatics - we median-normalise each timepoint (subtract the median from every value) so that each timepoint has the same median (0).

After median-normalisation, the distribution of the maximum and minimum values are generally similar for all time points.

4 Results

We apply the models from Section 2.1 to the datasets described in Section 3, and attempt to identify those groups of genes that are critical for viral translation.

There are two tasks we need to perform - firstly, identify an optimal , and secondly, decide the benchmark algorithm (the most optimal model) that we will compare all our other techniques with.

We evaluate the performance of the models through the methods described in Section 2.3 and interpret the significance of the results obtained.

We perform our analysis in a systematic way -

  1. A priori, we know that the datasets will not contain more than 20-30 clusters. We restrict our analysis for .

  2. We apply model-based clustering techniques on the complete datasets, and use the evaluation criteria to determine the optimal .

  3. If the results are inconclusive, we implement feature engineering in Section 4.1 and re-run the model-based clustering techniques on this modified data.

  4. Attempt to determine through BIC/AIC.

  5. If no optimal value is found, do clustering for specific (as determined by biologists, as in Figure 15).

  6. Interpret results through GO analysis and/or other analyses.

4.1 Feature Engineering

Since this is time-series data, we also quantify the change between successive timepoints.

From the RNASeq and RPF datasets, we find the amount of change between timepoints and add these as features to the original dataset. Formally known as fold change in bioinformatics, it is calculated as follows -

  • Calculate the of the gene expression value, cdRPKM, for each timepoint. Let this be .

  • Subtract (the value at timepoint 0) from each successive timepoint .

  • Add this as a new attribute to the RNASeq and RPF data.

For TE data, since the values are already log ratios, we merely subtract each timepoint’s value from the first timepoint.

Even with the new features, we might encounter problems due to noise. Fortunately, we have a biological approach of dealing with this.

When a cell is infected with a virus, the RNASeq does not change much (as it measures the total content of all genes) while those genes that help in virus production will get translated more, i.e. the RPF of these genes will increase. Since TE measures translational efficiency (), the TE of such genes will also keep increasing across timepoints.

We filter out those genes that do not change much at the RNASeq level. We retain genes which have foldChange values for each timepoint between and . More formally, if is the foldChange for , then for each (later on, we reduce this threshold further to [-0.3,0.3] to check whether the GO terms we obtained are indeed important or not).

After this filtering, we are left with 5576 genes that are highly translated.

As we are interested in genes that are highly translated between timepoints, we apply hierarchical clustering only on the foldChange features in the TE dataset. We plot heatmaps for different linkages, and check the general trends to find the number of clusters. In a heatmap, the values of each gene are plotted according to the colour key (bottom left).

The values in the heatmaps are sorted accorded to how they have been clustered.

In our case, we find that complete linkage performs the best. We can identify 4-6 distinct clusters from the heatmap, and cut the dendogram accordingly (represented by the green line). The result is shown in Figure 15.

Figure 15: Heatmap with Complete Linkage Hierarchical Clustering for Filtered Data

After applying hierarchical clustering and doing heatmap analysis on the filtered data, we infer that lies between 4 and 6. We use these values of for clustering for Step 5 of Procedure 4.

4.2 Model-Based Clustering Results

4.2.1 K-Means

We first apply K-Means on all the datasets to determine the optimal .

We initialise the algorithm using K-Means++ for optimal clusterings, and apply the Elbow method and BIC to evaluate our results. Results for RNASeq are given in Figure 16.

(a) Elbow Method
(b) BIC Curve
Figure 16: Cluster Analysis for RNASeq

The full set of graphs for all datasets are given in Appendix A, Section A.1.1.

The results are summarised in Table 2. "-" means inconclusive.

Data Elbow Method BIC
RNASeq (5 points) - 15*
RPF (5 points) - -
TE (5 points) 5 17*
Table 2: Optimal K for different datasets, K-Means

*For BIC, if there is no clear peak, we take the first local maxima as

From the plots, we observe that determining the optimal is difficult.

There is no clear ’elbow’ point, and the BIC keeps increasing with ; thus K-Means is not penalising enough. (Using AIC is futile, as the penalisation would be even lower in this case). The elbow plots and GO analysis do not mutually agree on a .

Despite the mixed results from the information criteria metrics, we went ahead and performed GO analysis as per the determined in Table 2. However, the GO results were also found to be inconclusive - there were multiple clusters associated with the ’viral process’ term, and the p-value was quite high (i.e. the terms were not enriched).

Thus, there is no reliable estimate for . This is most likely due to 2 reasons -

  • K-means cannot capture temporal dependencies in data.

  • Data is too noisy.

Both can be solved through feature engineering - we add the changes between the timepoints as features.

We filter the data as described in Section 3, and using the heatmap results, perform K-Means clustering for 4,5 and 6 clusters on the foldChange TE data.

We plot heatmaps to observe the general trends of clusters from this heuristic method, and identify those clusters which have an increasing trend (the clusters are plotted top to bottom in the heatmap and demarcated by black lines).

(a) 4 clusters
(b) 5 clusters
(c) 6 clusters
Figure 17: Heatmaps for K-Means clustering on TE data (top to bottom)

Cluster 2, 3 and 5 in Figures 16(a), 16(b) and 16(c) have an increasing trend, so we focus on the GO analyses of those clusters (covered in Section 4.4).

This heuristic method gives us GO results (specifically, terms related to GTPases from Section 1.2.4.1) that are relevant to our hypothesis. This is covered in greater detail in Section 4.4.

4.2.2 Gaussian Mixture Models

As seen in the K-Means approach, it is futile to do GO analysis on complete datasets (especially for the noisy RNASeq data, as we will most likely observe no significant trends by merely tracking overall changes in gene expression).

To test this, we still apply GMMs on the complete TE data (which are most likely to show certain trends due to genes being translated). As expected, we have no definitive results. The GO is unclear, and the BIC keeps increasing with for all covariance settings, offering no clear peak. The plots are in Appendix A, Section A.2.

We decide to apply GMMs on the feature-engineered data from Section 4.1. We use the full covariance matrix and do BIC analysis for .

Figure 18: BIC for full covariance matrix, TE data

For filtered TE data, we conclude that the ideal number of clusters are from Figure 18. We plot the heatmap and general foldChange graph (in Section A.2.2) for this clustering to check the general trend as per the heatmap (clusters are demarcated by the green line).

Figure 19: Heatmap for 2 clusters, GMM clustering on TE data

We see in Figure 19 that Cluster 1 has an increasing trend. The GO analysis of this cluster is similar to that obtained through K-Means - we again observe terms related to GTPases.

We still perform GMM-based clustering for for evaluating GMMs performance against the benchmark algorithm in Section 4.3.

4.2.3 HMM-based Clustering

We speculated that HMM-based clustering might perform better than other algorithms on the complete and unfiltered datasets, as HMMs inherently take the time dependencies into account.

We assume hidden states for simplicity, and apply HMM-based clustering on all the datasets. The plots are in Section A.4.1, while the results are summarised in Table 3.

Data BIC
RNASeq (5 points) 13*
RPF (5 points) 8*
TE (5 points) 16*
Table 3: Optimal K for different datasets, HMMs

*For BIC, if there is no clear peak, we take the first local maxima as

Again, we observe no clear peaks. The GO results for the local peaks are inconclusive, although some of the terminologies in the GO for RPF and TE do include GTPases and other related terms. However, there were no other clear trends.

We can deduce from BIC’s inability to determine an optimal that we may be estimating too many parameters and unnecessarily complicating the models (Occam’s Razor).

(a) Heatmap for 4 clusters
(b) Heatmap for 5 clusters
Figure 20: HMM Heatmap results on feature-engineered data

We proceed to the next steps of the analysis. We apply HMM-based clustering on the feature-engineered data for (we skip 6 as HMMs take the longest time to train) and generate heatmaps in Figure 20. We then check the heatmaps for up-regulated clusters, and perform GO analysis on the Clusters 2 and 1 respectively from Figure 20.

The trend graphs of each cluster are given in Section A.4.2.

The GO results from HMMs are comparable to those from GMMs and K-Means, and we again observe GTPases in the gene ontology of the up-regulated clusters.

4.2.4 Bayesian Networks

A limitation of Bayesian networks is that the current implementation is only applicable to discrete random variables (or discrete features).

Although there is some literature that deals with hybrid Bayesian Networks that model data containing both discrete and continuous attributes [cobb2007bayesian], data with purely continuous variables are difficult to model using Bayesian Networks.

Thus, as a first step, we discretise our gene expression data [pham2009unsupervised].

According to [pham2009unsupervised], there is no conclusive way of determining for CL Multinets based clustering, so we skip AIC/BIC analysis here. Instead, we perform clustering using to determine the structure of the underlying Bayesian network for the complete datasets.

The heatmap results for the TE dataset showed some increasing trends (covered in Section 29; the GO analysis for the complete TE dataset included GTPases in more abundance than HMM-based clustering.

Moreover, the underlying structure for each tree (determined by the clustering, for different ) unfailingly resulted in the same line graph (in Figure 21), with the timepoints in order (timepoint 1, followed by timepoint 2, and so on). This, along with the GO results, shows that Bayesian networks cluster even noisy data with a certain level of accuracy.

Figure 21: Tree Structure

We then apply Bayesian networks to the feature engineered TE dataset. The expression values range from -2 to 4, and we choose 100 bins between -2 and 4, given by and then digitise each observation according to the bins it belongs to.

The heatmaps for 6 clusters are given in Figure 22. The blue lines demarcate the clusters.

(a) Heatmap for 4 clusters
(b) Heatmap for 5 clusters
(c) Heatmap for 6 clusters
Figure 22: CL Multinets Heatmap results on feature-engineered data

Although the heatmaps are not as uniform and do not display concrete trends (as those from K-Means clustering) the GO results are reliable and contain more GTPase-related terms, along with some new terms related to viral translation (e.g. G2/M transition of mitotic cell [davy2007g2]).

4.3 Performance Evaluation

We compare algorithms against a benchmark for the feature engineered TE dataset (with the foldChange features).

Bayesian networks consider the inherent time-dependency between attributes, resulting in a graph with the timepoints in correct sequence each time.

The GO results are also significant as they capture those genes that are up-regulated yet also distinct from each other despite having similar expression values (unlike K-Means, which clusters according to Euclidean distance between absolute attribute values).

We pick Bayesian networks as our benchmark algorithm, and treat the results obtained from it as the ground truth.

We use the ME distance metric to compare the other techniques against Bayesian networks for . For ME Distance, the closer it is to 0, the better the clustering is. We calculate the accuracy as 100(1-)%.

Accuracy for Clusters
Algorithm             4             5             6
K-Means          56.52%          48.35%          41.61%
GMMs          58.40%          53.11%          50.31%
HMMs          68.72%          79.45%              -
Table 4: Performance evaluation for all algorithms*

*compared to Bayesian networks

4.4 Biological Results

Existing literature on picornavirus (which the EV-A71 virus is a type of) suggests that the proteins in the virus interact with certain GTPases and disrupt normal cellular functions (like nuclear transport of other viral and nuclear proteins). [porter2006picornavirus].

Looking more closely at the collection of heatmaps obtained by all the algorithms, we identify all those clusters that are up-regulated and contain terms containing/related to GTPases in their GO analysis.

Figure 23: Heatmap from K-Means, RNASeq filter [-0.5, 0.5]

To be more rigorous about the terms we observe, we also reduce the filter to and check the GO analysis of the up-regulated clusters. For each cluster, we only consider enriched GO terms (those with low p-values, mentioned in Section 2.3.4).

Figure 24: Heatmap from K-Means, RNASeq filter [-0.3, 0.3]

Throughout our (repeated) experiments, we find that terms related to GTPases from Biological Processes (BP) and Molecular Functions (MF) keep re-appearing. The terms of interest are -

  • Positive regulation of GTPase activity

  • Activation of GTPase activity

  • Regulation of small GTPase mediated signal transduction

  • Positive regulation of transcription, DNA templated

  • Signal transduction

Through the principle of guilt-by-association mentioned in Section 4, we also identify other enriched terms that are of interest -

  • Viral Process

  • DNA repair

  • Negative regulation of transcription from RNA pol II promoter

  • Establishment of cell polarity

  • Smoothened signaling pathway

  • Positive regulation of dendrite development

  • G2/M transition of mitotic cell

Some GO terms (viral process, regulation of small GTPase mediated signal transduction, DNA repair) are observed across multiple clusters in the HMMs and Bayesian Networks analysis. This suggests interactions between the clusters too.

From the GO analysis, we find plenty of evidence from our datasets to support our hypothesis that a certain class/member of GTPase is important for establishing this infection.

5 Conclusion

This thesis explored the application of unsupervised learning techniques on viral genomic data - more specifically, human cells infected with the EV-A71 virus (which causes HFMD). We attempted to determine host factors important to the propagation of this virus, and identified several key biological terms based on our clustering results.

5.1 Summary

As described in Section 1.3.1, we collected RNASeq and RPF gene expression data at different timepoints. In Figure 2, we saw that the space occupied by the virus increases with time. Initial analysis of the complete datasets (RNASeq, RPF, TE) were consistent with this trend. Next, we attempted to determine groups of genes crucial to viral translation. Initial clustering results contained some relevant GO terms (e.g. viral process, DNA repair) that appeared in multiple clusters; however, the trend graphs and heatmaps proved to be inconclusive, suggesting that our datasets were too noisy, containing many points insignificant to our analysis.

We then focused on those genes which were highly translated at the RPF level but not at the RNASeq level. Collating data from all our GO analyses, we identified the following terms - Positive regulation of GTPase activity, Activation of GTPase activity, Regulation of small GTPase mediated signal transduction, Positive regulation of transcription, DNA templated and Signal transduction.

These terms imply that some proteins in the class GTPase are definitely important to viral translation, possibly increasing the production of viral genes at key phases.

Following up on these results, we conduct more extensive literature reviews to verify the link between small GTpases and picornavirus infection.

5.1.1 Biological Experiments

We also design experiments to verify whether GTPases are relevant to the EV-A71 virus.

The first experiment tracks changes in GTPase activity throughout infection, focusing more on Rho GTPases. For GTPase protein-specific activity, there are also targeted studies to check for specificity of GAP/GEF, which are related to activation of GTPases.

The second experiment inhibits specific GTPases to check their effect on viral cells. 3 kinds of drugs targeting different GTPases are used -

  • Specific Rho drugs e.g. Rhosin or C3 transferase against RhoA/B/C, but not cdc42/Rac1

  • Specific cdc42 drugs e.g. Casin, ML141

  • Specific Rho kinase inhibitors e.g. Rhodblock series

Thus, the appearance of terms related to viral process in the GO analysis for all methods suggests that our clustering was useful. The abundance of GTPase-related terms also support our hypothesis that they are indeed vital to the propagation of the virus inside the cell.

6 Future Work

6.1 Algorithms

6.1.1 Tuning

We can improve the performance of our current model-based clustering techniques by incorporating additional data or adjusting the hyperparameters for the prior distributions we use for initialisation. We can consider discretising the data for Hidden Markov Model based clustering

[zhao2010hmm]; conversely, we can consider continuous variables for Bayesian networks [hofmann1996discovering].

6.1.2 Different Algorithms

We can consider self-organizing maps (SOMs) for clustering gene expression data too

[tamayo1999interpreting]. SOMs impose partial structure on the clusters (in contrast to the rigid structure of hierarchical clustering, the strong prior hypotheses used in Bayesian clustering, and the non-structure of k-means clustering) and facilitate easy visualization and interpretation. SOMs have good computational properties and are easy to implement, reasonably fast, and scalable to large data sets too.

We can also consider soft clustering, or fuzzy clustering; where one gene may belong to more than one cluster. Existing fuzzy-clustering techniques can indeed be modified for short time-series gene expression data [moller2003fuzzy].

Appendix A Appendix

a.1 K-Means

a.1.1 BIC Plots

(a) Elbow Method
(b) BIC Curve
Figure 25: Cluster Analysis for RPF
(a) Elbow Method
(b) BIC Curve
Figure 26: Cluster Analysis for TE

a.2 Gaussian Mixture Models

a.2.1 BIC Plots

(a) Spherical
(b) Diagonal
(c) Tied
(d) Full
Figure 27: BIC for different covariance settings in GMMs, TE

a.2.2 Trend Plots

(a) Cluster 1
(b) Cluster 2
Figure 28: Trend plots for foldChange values, 2 clusters, GMM

a.3 Bayesian Networks

a.3.1 Heatmaps for Complete Datasets

The clusters are demarcated by red lines.

(a) 4 clusters
(b) 5 clusters
(c) 6 clusters
Figure 29: Heatmaps for foldChange TE value

a.4 HMM-based Clustering

a.4.1 BIC Plots

(a) RNASeq data
(b) RPF Data
(c) TE Data
Figure 30: BIC results for HMM-based clustering

a.4.2 Trend plots

(a) Cluster 1
(b) Cluster 2
(c) Cluster 3
(d) Cluster 4
Figure 31: Trend plots for foldChange values, 4 clusters, HMM-based clustering
(a) Cluster 1
(b) Cluster 2
(c) Cluster 3
(d) Cluster 4
(e) Cluster 5
Figure 32: Trend plots for foldChange values, 5 clusters, HMM-based clustering
List of Figures

References