Acknowledgements
First and foremost, I would like to express my utmost gratitude to my supervisors; Assoc. Professor Vincent Tan, for introducing me to machine learning in MA4270, and Dr. Huili Guo for breaking down complex bioinformatics concepts into simple ones for me to understand. I am extremely grateful for their guidance during the entire period of my honours project. Throughout the year, they have been understanding of my strengths and weaknesses and tolerant of my mistakes, guiding me along when I hit roadblocks, all the while suggesting new ideas and patiently helping me clarify my doubts. I have learnt so much from both of them, both about academia and life, and without their constant guidance, this thesis would not have been possible. I couldn’t have asked for better mentors. Thank you.
I would also like to thank the people in Dr. Huili Guo’s lab, especially Indrik and Shawn, for being so helpful when I was struggling with certain concepts. I also express my appreciation to Zhao Yue for being a helpful companion on this journey, and wish him well for his FYP.
To my friends  Spatika, Srishti, Megha, Bushra and Kai Ting  thank you for your support and encouragement. You have listened to my neverending rants, put up with my idiosyncrasies, seen me at my lowest points and yet still stuck by me.
Last but not the least, I would like to thank my family. Despite living so far, you have always managed to be there for me. I’m so grateful for your support and encouragement throughout my undergraduate years. To my sister and brotherinlaw; despite your busy lives and the personal tragedy you faced last year, you continued lending a listening ear to my insignificant complaints. To my Mom and Dad  you have always been my pillars of support.You have never pressured me and always encouraged me to do my best. From sending me homecooked meals, visiting me in Singapore when I couldn’t make it home, and offering me life advice when I was at crossroads, you have helped me up when I felt like I was on the edge of failure. Thank you so much.
Akankshita Dash
Abstract
This report explores the application of machine learning techniques on short timeseries gene expression data. Although standard machine learning algorithms work well on longer timeseries’, they often fail to find meaningful insights from fewer timepoints.
In this report, we explore modelbased clustering techniques. We combine popular unsupervised learning techniques like KMeans, Gaussian Mixture Models, Bayesian Networks, Hidden Markov Models with the wellknown Expectation Maximization algorithm. KMeans and Gaussian Mixture Models are fairly standard, while Hidden Markov Model and Bayesian Networks clustering are more novel ideas that suit timeseries gene expression data.
Contents
 1 Introduction
 2 Literature Review
 3 Exploratory Analysis
 4 Results
 5 Conclusion
 6 Future Work
 A Appendix
1 Introduction
Machine learning
is a field of artificial intelligence that uses statistical techniques to give a computer the ability to learn from data, without explicitly programming it to do so.
Genomics is a branch of molecular biology concerned with the structure, function, evolution, and mapping of an organism’s complete set of DNA.
In this project, we apply machine learning techniques to viral genomic data recovered from human cells, and attempt to derive useful insights about the EVA71 virus, also known as handfootandmouth disease (HFMD).
1.1 Machine Learning
With the large volume of data available nowadays, it is impossible to apply statistical techniques and derive useful insights without making use of a computer.
Considered one of the biggest innovations since the microchip, machine learning enables humans to build models and capture patterns within data. It allows one to define simple rules to improve the performance of these models. The applications of machine learning are manifold  from spam filters in email to online banking, it is omnipresent. Machine learning’s presence is growing, and its use cases continue increasing.
A machine learning model can be defined by its training data and the set of hypotheses functions.
Machine learning is broadly divided into the following 3 categories:

Unsupervised Learning
In supervised learning, the training data are labelled, i.e. each data point is attached with its true value, or ground truth. Typical problems include classification (binary/multiclass) and regression. E.g. If given a training dataset consisting of pictures of cats and dogs, our model accurately predicts whether a new image is that of a cat or a dog. This is a classification problem. Another simple example is of a dataset with people’s heights and weights. Given a new person’s height, we should be able to accurately predict weight. This is a regression problem.
In unsupervised learning
, unlike supervised learning, the training data are unlabelled. The number of labels may or may not be specified beforehand. Unsupervised learning is commonly used as an initial exploratory technique to draw useful patterns from the dataset. E.g. Given many different books, classifying them into different categories, or
clusters, can help us identify the genres they belong to.Reinforcement learning is different from both supervised and unsupervised learning. It focuses on maximising reward potential from the data by balancing the tradeoff between exploration and exploitation. E.g. In recommendation engines, the items showed to a particular user may be ordered based on how much they liked previous products or how different they are from previous products. This is done to maximise the number of clicks, i.e. the reward.
Since the data that we have in this project are unlabelled, our focus is on unsupervised learning.
1.2 Genomics
Genomes contain the complete set of genes present in an organism, and genomics describe how this genetic information is stored and interpreted in the cell.
1.2.1 Key Definitions and Terms

Nucleobases: Components of nucleotides, that are the building blocks of DNA  Adenine (A), Thymine (T), Cytosine (C), Guanine (G). Additionally, Uracil (U) is used instead of Thymine (T) for RNA.

Codons: A sequence of three DNA or RNA nucleotides that corresponds with a specific amino acid or stop signal during protein synthesis. For eukaryotes, there are a total of 64 such codons, which map to 20 amino acids (and stop signals) through the genetic code.

Amino acid: Organic molecules that make up the building block of proteins. There are 20 canonical amino acids.

DNA: Deoxyribonucleic acid, a double helix molecule, that contains the genetic material for all organisms on Earth (including viruses).

RNA: Ribonucleic acid, a single helix, polymeric molecule that is essential in various biological roles in coding, decoding, regulation and expression of genes. Biologically active RNAs, including transport, ribosomal and small nuclear RNA (tRNA, rRNA, snRNAs) fold into unique structures guided by complementary pairing between nucleotide bases. Cellular organisms use messenger RNA (mRNA) nucleotides to direct synthesis of specific proteins.

Protein: Sequences of amino acids that carry out the majority of cellular functions such as motility, DNA regulation, and replication.

Transcription: The process in which a DNA sequence of a gene is rewritten, or transcribed, to make an RNA molecule. (In eukaryotes, the RNA polymerase makes a strand of mRNA from DNA).

Translation: The sequence of the mRNA is decoded, or translated, to specify an amino acid sequence (i.e. protein).

Gene: A region of DNA (deoxyribonucleic acid) coding either for the messenger RNA encoding the amino acid sequence in a polypeptide chain or for a functional RNA molecule.

Chromosome: Cellular structures that contain genes. A chromosome comprises of a single DNA molecule that may be either circular or linear.

Virus: A small infectious agent that replicates only inside the living cells of other organisms.

Gene Expression: The generation of a functional gene product from the information encoded by a gene, through the processes of transcription and translation. Gene products are often proteins. ^{1}^{1}1Nonprotein coding genes can encode functional RNA, including ribosomal RNA (rRNA), transfer RNA (tRNA) or small nuclear RNA (snRNA)
1.2.2 Central Dogma
The central dogma of biology states that: DNA is transcribed by RNA polymerases into mRNA (messenger RNA), which is read by ribosomes to generate protein through translation. [crick1970central]
In this project, our focus is on translation. More formally, we will be studying the translational interaction between the EVA71 virus and its host cell, by observing gene expression values at different timepoints.
1.2.3 VirusHost Interaction
When a host gets infected with a virus, the virus hijacks the host cell’s translation machinery, which affects protein production.
In the context of EVA71, the viral replication process begins when a virus infects its host by attaching to the host cell and penetrating the cell wall or membrane. The viral genome then hijacks the host cell’s translation mechanism, forcing it to replicate the viral genome by producing viral proteins to make new capsids (the protein shell of a virus that protects it). The viral particles are then assembled into virions. These virions burst out of the host cell during a process called lysis, killing the host cell. Some viruses take a portion of the host’s membrane during the lysis process to form an envelope around the capsid. [ScitableVirus]
After infection with any virus, overall translation of all genes decreases.
*where Chromosome E (chrE) specifies the viral genes
In Figure 2, Following viral replication, the new viruses may go on to infect new hosts. Many viruses such as influenza, chicken pox, AIDS, the common cold, and rabies cause diseases in humans.
1.2.4 Hand, Foot, and Mouth Disease
Hand, foot, and mouth disease (HFMD) is a common infection caused by a group of viruses. The one we will be examining in this report is the enterovirus A71, or the EVA71 virus. HFMD mostly affects small children, occasionally causing symptoms to develop in adults. It typically begins with a fever, which is followed by rashes and bumps on other parts of the body. Currently, there is no treatment that specifically targets the disease, nor is there a vaccine that is approved for use in Singapore [ang2009epidemiology].
1.2.4.1 GTPases
Viruses depend on host cells for propagation and dissemination. Host cell mechanisms that allow viral entry, facilitate viral replication, and enable viral egress, are targeted for exploitation by viral pathogens [amorim2018comprehensive].
Guanosine5’triphosphate (GTP) is a purine nucleoside triphosphate. It is one of the building blocks needed for the synthesis of RNA during the transcription process, and is involved in energy transfer within the cell. GTPases are a class of proteins that bind and hydrolyze GTP. GTPases include Rab, Ras, Rac, Ran, Rho, Arf, and Sar proteins.
Our focus is on the proteins Rab and Rho. Emerging evidence demonstrates that viruses have evolved numerous strategies to modulate Rab proteins’ functions [amorim2018comprehensive].
In polioviruses, the 2C protein possesses weak GTPase activity. The 2C protein in EVA71 viruses plays an important role in viral replication [wang2018divergent]. Our hypothesis is that a certain member protein of GTPase is important for establishing the infection in EVA71 viruses, and could be linked to Rho GTPases. We focus on the translation level changes in attempt to find information supporting our conjecture.
1.2.5 Objective
When a virus infects a cell, most of the processes in the cell get shut down and some proteins do not get generated. There are groups of:

Genes that escape this shut down by the virus

Genes are affected by the shutdown

Genes that are not influenced by the virus
[1] is the most important, as these are the ones that the virus needs for its own production.
Thus, our objective is to identify these groups from the set of genes that we have available. We do this by using unsupervised learning techniques on gene expression data to generate clusters.
1.3 Data Generation through Genome Sequencing Techniques
RNA sequencing refers to techniques used to determine the sequence of RNA molecules. It includes highthroughput shotgun sequencing of cDNA molecules obtained by reverse transcription from RNA, and nextgeneration sequencing technologies to sequence the RNA molecules within a biological sample in an effort to determine the primary sequence and relative abundance of each RNA molecule.
For this project, we primarily use gene expression data that has been generated using the following techniques:

RNASeq
: RNASeq uses nextgeneration sequencing (NGS) to reveal the presence and total quantity of RNA in a biological sample at a given moment.

Ribosome Profiling (RPF): RPF uses specialised messenger RNA (mRNA) sequencing to determine which mRNAs are being actively translated. Unlike RNASeq, which sequences all of the mRNA of a given sequence present in a sample, RPF only targets mRNA sequences that are being actively translated.
1.3.1 Methodology
We posit that host factors regulated in viral translation may be important for establishing viral infection. The infection and analysis was carried out as follows 

Infect cells at various timepoints

Collect lysates post infection (cold synced) at various timepoints  0 hours post infection (h.p.i), 2 h.p.i, 4 h.p.i, 6 h.p.i, 8 h.p.i

Perform clustering analysis to identify relevant genes

Validate changes in genes

Test for functions of genes in viral infection
2 Literature Review
2.1 Clustering Time Series Gene Expression Values
Clustering techniques are essential in the data mining process, as they reveal natural structures and identify interesting patterns in the underlying data.
For gene expression data observed at different timepoints, clustering can help identify biologically relevant groups of genes and samples.
Through the principle of guiltbyassociation [quackenbush2003microarrays], if we find a known cluster that contains some uncharacteristic genes, we can hypothesise that these genes are significant.
As mentioned in [d2005does], there is no onesizefitsall solution to clustering, or even a consensus of what an ideal clustering should be like. Clusters may have arbitrary shapes and sizes, and each clustering criterion imposes a certain structure on the data. If the data happen to conform to the requirements of a particular criterion, the true clusters are recovered; however, very rarely is this the case. Each algorithm imposes its own set of biases on the clusters it determines. Most (reasonable) clustering algorithms would yield similar results on synthetically constructed datasets; however, in practice, they can give widely differing results on realworld,noisy, gene expression data.
For temporal gene expression data, the most commonly used clustering algorithms [friedman2000using, jiang2004cluster, schliep2003using] are

Model Based Clustering

KMeans

Gaussian Mixture Models

Hidden Markov Models

Bayesian Networks


Self Organising Maps
For all our model based clustering techniques, we determine partitions by using the expectationmaximisation (EM) algorithm (described in 2.2.1
) for maximum likelihood estimates of model parameters .
In this report, we will use hierarchical clustering for exploratory analysis, and apply all the model based clustering techniques to the different datasets we are provided with.
2.2 Algorithms and Models
2.2.1 Expectation Maximisation (EM) Algorithm
The expectation maximisation (EM) algorithm is a general technique for finding maximum likelihood solutions for probabilistic models having latent variables [bishop2006pattern]. EM is used for all the model based techniques covered in this report.
2.2.1.1 Procedure
Consider a probabilistic model in which we collectively denote all the observed variables by and the hidden variables by
. The joint distribution is governed by a set of parameters, denoted
. We try to maximise the likelihood function, that is given by(1) 
If is continuous, we replace the summation with integration as appropriate. The maximum likelihood estimator is
Direct optimisation of is difficult due to the presence of latent variables, so we attempt to optimise the expected value of the loglikelihood of the data given our model. We are now ready to define the general EM algorithm in algorithm 1.
), i.e find the conditional probability distribution of the latent variables given our current model
and the data .(2) 
(3) 

Iterate till reaches the maximum number of iterations specified

Till the parameter estimates stop changing

Till the likelihood function stop changing
2.2.1.2 Limitations
Convergence of the EM algorithm is guaranteed by the MajorisationMinimisation technique [TanVYF]. However, the EM Algorithm typically converges to the local optima. Thus, it needs to be run multiple times in order to find the best fitting model and its parameters.
2.2.2 KMeans
KMeans is an algorithm to identify groups, or clusters, of data points in a multidimensional space [bishop2006pattern].
We group data points based on their feature similarities according to a given distance metric (Euclidean Distance in our case), by minimising a given cost function.
Given a set of data points and fixed number of clusters , the algorithm works as follows [TanVYF]
For a given partition , where
this algorithm minimises the overall squared error:
(4) 
where are the clusters and are cluster means. The two steps of updating the clusterings and reestimating the centers correspond to the E step and M step of the EM Algorithm respectively.
Equation (2.2.2) is also known as the cost function.
The main problem of kmeans is its dependency on the initially chosen centers. If the centers are chosen arbitrarily, then the algorithm converges very slowly or results in poor clustering. However, this can be solved by using the KMeans++ [arthur2007k] algorithm, which selects centers that are far apart from each other in a probabilistic manner.
Through KMeans++, we choose the initial centers as follows [TanVYF]

Take a random data point chosen uniformly from dataset .

Calculate new center , by choosing
with probability
, (where denotes the shortest distance from the data point to its closest, already chosen center) 
Repeat the previous step till we have centers.
2.2.2.1 Limitations
Due to its ease of implementation and termination in a finite number of steps, KMeans is one of the most popular (and usually one of the first) algorithms applied for unsupervised learning problems.
However, due to the nature of the distance function chosen (Euclidean), Kmeans doesn’t capture temporal dependencies in the data. Thus, it is not entirely suitable for our time series dataset. Moreover, kmeans does hard clustering, i.e. a point can belong to only one cluster, while in real life data, a point can belong to more than one cluster.
2.2.3 Gaussian Mixture Models
Gaussian mixture models (GMMs) are mixtures, or linear combinations, of models with Gaussian distributions. They are widely used in data mining, statistical analysis and pattern recognition problems
[bishop2006pattern].A Gaussian mixture model is represented by
(5) 
where the equation above represents a mixture of Gaussians, is a component of the mixture model (each having its own mean and covariance , and are the component weights, such that .
To use GMMs for clustering, we assume that our data are generated i.i.d from a mixture of Gaussians, and we apply EM to estimate the means, covariances, and weights, and thus determine the clusterings. We achieve this by introducing a latent variable in (5). This latent variable has a 1ofK representation  thus and (for each ), and find and , to get
(6) 
where and .
Another quantity that is important is the conditional probability of given , also known as the responsibility that component takes for producing the observation , denoted by
(7) 
These formulas containing latent variable help us formulate EM for GMM, given in 3.

Paramater estimates have converged (stopped changing).

Log likelihood has converged.

Maximum umber of iterations have been reached.
The clusterings are indicated by the latent variable .
2.2.3.1 Covariance Matrix
Having different settings of the covariance matrix for each component help in restricting the number of parameters that need to be estimated (thus optimising the runtime of the algorithm). However, some of the settings (diagonal, tied and spherical) restrict the contour of the clusters obtained, as seen in figure 5. In the diagonal setting, the elliptical clusters obtained are aligned with the coordinate axes, while the spherical clusters are circular, while all clusters have the same shape in the tied setting. Assuming we have components, here are 4 different covariance settings, summarised in the table 1.
Covariance  Description  Parameters 

Full  20cm  
Each component has its own  
general covariance matrix  
Diagonal  20cm  
Each component has a  
diagonal covariance matrix  
Spherical  20cm  
Each component  
has a single variance 

Tied  20cm  
All components share the  
same general covariance matrix  
2.2.3.2 Limitations
As discussed for previous models, GMM converges to a local optima, so we need to run the model multiple times to determine the best fit. Moreover, GMM takes a long time to run  however, we can indicate different parameter settings for the covariance matrix, based on what we know (or assume) about the data. This helps keeps the number of parameters that we are estimating under control and reduces the computation time.
2.2.4 Hidden Markov Models
In a firstorder Markov chain describing a sequence of possible events, the probability of each event depends only on the state attained in the previous event.
A hidden Markov model can be viewed as a Markov chain of latent variables, with each observation conditioned on the state of the corresponding latent variable. The states of the Markov chain are not observed directly, i.e. the Markov chain itself remains hidden [bishop2006pattern].
A single time slice of HMM corresponds to a mixture distribution, with component densities given by . Therefore HMMs can also be interpreted as an extension of a mixture model in which the choice of mixture component for each observation is not selected independently but depends on the choice of component for the previous observation [bishop2006pattern].
HMMs are widely used in speech recognition, natural language modelling, online handwriting recognition and for the analysis of biological sequences such as proteins and DNA [bishop2006pattern]. Thus, we can use a Hidden Markov Model (HMM) to model temporal dependencies in the data.
An HMM contains 3 essential components  observed sequence of data , hidden states , transition probabilities and emission probabilities .
The hidden states in an HMM are analogous to latent variables in GMMs, responsible for generating the given observations. The probability distribution of a hidden state
depends upon the state of the previous hidden state . This is encoded by a transition probability matrix. For an HMM with hidden states, the transition probability matrix has independent parameters.The emission probabilities are conditional distributions of the observed variables are , where is a set of parameters governing the distribution. The may be given by Gaussians if the elements of are continuous variables, or by conditional probability tables if is discrete [bishop2006pattern, glass_zue_2003].
There are 3 basic problems tackled by hidden Markov models [TanVYF, glass_zue_2003] 

Given an observation sequence and a model , compute the probability . This is achieved using the ForwardBackward algorithm.

Given an observation sequence , compute the most likely observed state sequence. This is done using the Viterbi algorithm.

Adjust the model parameters of to maximise using the BaumWelch algorithm.
2.2.4.1 HMMbased Clustering
We can apply the EM algorithm defined in Algorithm 1 to do clustering using HMMs.
The EM procedure for HMM based clustering is fairly complex, as we need to use a subEM procedure (BaumWelch algorithm) in the M step, times.
Starting from an initial collection of HMMs, EM iteratively finds cluster assignments that maximise the joint likelihood of the clustering.
Initialisation: Given sequences , where [1,], we randomly generate HMMs and a partitioning , where the assignment to clusters is done to maximise the following objective function:
(8) 
where defines the likelihood function, i.e. the probability density for generating sequence by model .
Assuming that we have a collection of HMMs, our algorithm is as follows:

Iteration ( {1,2,…}):

E Step: Generate a new partitioning of the sequences by assigning each sequence to the model for which the likelihood is maximal.

M Step: Reestimate parameters using the BaumWelch algorithm [glass_zue_2003]: using their start parameters from the previous iteration and the sequences assigned to each model.


Stop, if one of the following takes place 

If the improvement of the objective function is below a given threshold

There are no more reassignments

The number of iterations is reached

2.2.4.2 Limitations
The biggest limitation with HMMbased clustering is convergence to a local optima, so the algorithm needs to be run multiple times with different random initialisations to find the best fit. However, like GMM training, HMM based clustering takes significant computation time  especially due to the BaumWelch EM procedure in the M step of the clustering.
2.2.5 Bayesian Networks
Bayesian networks are probabilistic graphical models that represent a set of variables and their conditional dependencies via directed acyclic graphs.
In [TanVYF], we studied the ChowLiu algorithm to learn treestructured graphical models. In [pham2009unsupervised], the authors combined ChowLiu Trees and Expectation Maximisation to derive a modelbased clustering method, called the CL Multinet Classifier.
If there are classes, then CL trees are learned. Each tree distribution approximates the joint probability distribution of the attributes, given a specific class. The root attribute has no parents, while the other attributes can have at most ONE parent, i.e. if the attributes are expressed as r.vs , then has only one attribute, and .
An inherent advantage of applying Bayesian networks case is that we can infer the structure behind each cluster too. For each cluster, we can see which timepoints affect the other.
Using Bayes Theorem, the posterior probability of a Chow Liu Tree can be expressed as
(9) 
We observe that the denominator is constant with respect to the class, so we express it as .
Then, the previous formula can be rewritten as
(10) 
The edge weights of a Chow Liu tree are encoded by the mutual information, which is derived from the empirical probability distributions observed from the data.
(11) 
A maximum weighted spanning tree (either Prim’s/Kruskal’s) algorithm is then used to find the ChowLiu tree for each class.
Unsupervised training of the CL Trees (henceforth referred to as CL Multinets) is carried out assuming that the data have been generated from a mixture of Bayesian networks. This mixture can be described as
(12) 
where
(13) 
The are the mixing coefficients and the are the joint probability distributions of the CL Multinets. Given observations, clusters (CL Multinets), and initial partitions P={P,P,…,P}, the EM procedure for the iteration is detailed in algorithm 4.
(14) 
(15) 
(16) 

Maximum number of iterations reached.

CML converges.

Parameters converge.
2.2.5.1 Limitations
Although Bayesian Networks are directed acyclic graphs and thus use fewer parameters than HMMs, unlike HMMs, we need to discretise the data in order to use the empirical frequencies to estimate joint probability distributions. Discretisation also results in loss of information, as the resultant dataset is dependent upon the number of bins chosen. Moreover, just like HMMbased clustering, there exist only general information theoretic criteria to determine the number of clusters , which is not always be applicable [pham2009unsupervised].
2.2.6 Hierarchical Clustering
All the previous models that we covered can be classified under partitional clustering. The other category, hierarchical clustering, seeks to build a hierarchy of clusters using a given distance metric for datum and a similarity criteria for the clusters obtained. However, unlike partitional clustering, there exist no definitive statistical techniques determine the optimal number of clusters  instead, we need to examine the resulting hierarchical structure (dendogram) and visually determine the optimal number of clusters [maimon2005data].
There are two approaches to hierarchical clustering 

Agglomerative: Each datum initially represents a cluster of its own. Clusters are then successively merged until the desired cluster structure is obtained.

Divisive: All objects initially belong to one cluster. This main cluster is then divided into subclusters, which are successively subdivided into their own subclusters, and this process continues until the desired cluster structure is obtained.
We set the distance metric (for within cluster distances) as Euclidean. The linkage criterion in hierarchical clustering specifies the difference between any two sets (clusters) obtained. It determines which clusters to merge by calculating the differences between two clusters as a function of the pairwise distances between observations. The different types of linkages are

Single: the distance between two clusters is equal to the shortest distance from any member of one cluster to any member of the other cluster.

Average: the distance between two clusters is equal to the average distance from any member of one cluster to any member of the other cluster.

Complete: the distance between two clusters is equal to the longest distance from any member of one cluster to any member of the other cluster.

Ward: the distance is defined as the error function of the unified cluster minus the error functions of the individual clusters, where the error function is the average distance of each data point in a cluster to the cluster centroid.
For this report, we use agglomerative hierarchical clustering not in model building, but in Section 4.1 for feature engineering , so as to obtain a rough estimate of the number clusters we may expect from the data.
2.3 Evaluation Metrics for Model Selection
Given data, model selection is the task of selecting the best statistical model from a set of candidate models.
In our clustering analysis, the main conundrum is determining the number of clusters.
The estimation of the true number of classes has been recognised as “one of the most difficult problems in cluster analysis" [bock1996probabilistic]. The correct choice of is often ambiguous  sometimes, choosing the best fitting model is not the wisest choice. Consider increasing without any penalty  this will keep reducing the error but the model is less than ideal (e.g. In KMeans, if we have clusters for points, each point is its own center, reducing the cost function to 0).
Moreover, given multiple candidate models having similar accuracy, the simplest model is most likely to be the best choice (Occam’s razor).
Thus, the optimal choice of balances accuracy and number of parameters required for the model. This is tackled in 2.3.1 and 2.3.2.
A secondary conundrum is comparing two different clustering models, given that we obtain "ground truth" values for some of the clusters. This is handled in 2.3.3.
2.3.1 Information Criteria
When a statistical model is used to represent the process that generated the data, the representation will almost never be exact. Some information is thus lost by the model. Bayesian Information Criteria (BIC) and Akaike Information Criteria (AIC) [burnham2004multimodel] deal with estimating this lost information, and calculate the tradeoff between the goodness of fit of the model and the complexity of the model.
2.3.1.1 BIC and AIC
BIC and AIC penalise the total number of parameters. Both are functions of the maximised loglikelihood of the model and the estimated number of parameters  however, BIC penalises model complexity more heavily. We define the AIC and BIC as follows
where is the maximised loglikelihood of the model , is the number of free parameters and is the number of data points. For a range of values, we calculate AIC/BIC, and choose the for which this criterion is maximum. Thus, higher AIC or BIC values indicate better fitting models.
2.3.2 Elbow Method
The elbow method [kodinariya2013review] is a popular method to determine the true number of clusters by plotting the cost function for a given clustering algorithm against different , and observing the point where the graph makes a sharp elbow shape, as seen in Figure 9. More formally, this method looks at the percentage of variance explained as a function of the number of clusters. In essence, the elbow method observes the point of sharpest decrease in the cost function, i.e. the most improvement to the cost function, and estimates that to be the ideal number of clusters.
However, the limitation of this method is that the elbow point cannot always be unambiguously identified  sometimes there is no elbow, or sometimes there are several elbows.
2.3.3 Misclassification Error (ME) Distance
ME distance is used to calculate the similarity between two different clusterings of the same data. Given the "ground truth" (i.e. true classes of each point), we can compare the clustering obtained by our technique with the ground truth to see how our model compares. ME distance thus represents the well known cost of classification, minimised over all permutations of the labels [meila2005comparing].
It is calculated as follows 
Consider two clusterings and
. We define the confusion matrix of
and as a matrix , with . We consider the case where both models cluster points, and .Then the ME distance is defined as
where contains all permutations of the label .
We want the ME Distance between our models and the ground truth to be close to 0.
2.3.4 Gene Ontology
Gene Ontology (GO) is a framework for the model of biology, unifying the representation of gene and gene product attributes across all species. We define concepts and classes used to describe gene functions, and observe the relationships between them.
The ontology covers 4 categories

Biological Process: operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms.

Cellular Component: the parts of a cell or its extracellular environment.

Molecular Function: the elemental activities of a gene product at the molecular level.

KEGG Pathway: pathways representing molecular interaction, reaction and relation networks for various biological processes.
We use DAVID [huang2007david] to perform gene ontology analysis on the clusters we generate.
Each cluster obtained from our model(s) contains genes that share similar biological properties. We perform GO analysis on all these different clusters, and attempt to identify terms that are related to infection  like viral process, translation, DNA repair, etc. If we do find terms related to viral process, we use guiltbyassociation to find other significant terms.
pvalue is the probability of seeing at least x number of genes out of the total n genes in the list annotated to a particular GO term, given the proportion of genes in the whole genome that are annotated to that GO term. In essence, the GO terms shared by the genes in the user’s list are compared to the background distribution of the annotation. Thus, the smaller the pvalue is, the more significant the particular GO term associated with the group of genes is (i.e. the less likely the observed annotation of the particular GO term to a group of genes occurs by chance) [huang2007david].
3 Exploratory Analysis
3.1 Data Description
We have 3 datasets consisting of unlabelled human gene expression values, obtained after infection with the EVA71 virus.

RNASeq: 12,340 samples of RNASeq data at 5 timepoints

RPF: 11,746 samples of RPF data at 5 timepoints.

TE: 11,633 samples of TE data at 5 timepoints.
The attributes are gene expression values hours post infection (h.p.i), where .
As described in Section 1.3, RNASeq represents the total amount of genes within the cell, while RPF represents the amount of genes being translated. TE measures the rate at which genes get translated.
3.1.1 Data Preprocessing
cdRPKM values are raw gene expression values; cdReads are the number of times a gene was read (as per the biological technique used), while the AccNum and GeneName represent how to ID the gene.
We first filter the dataset for cdReads to remove noise; this removes genes with very small reads, likely generated due to experimental errors.
Then, we transform the cdRPKM values to , which is a standard practice in bioinformatics.
We can calculate TE data in Figure 12 by dividing the raw RPF value by the raw RNASeq value, and transforming it to . Alternatively, we can subtract the RNASeq value from the RPF value to also obtain TE.
3.2 Initial Analysis
From the plots, we infer that all timepoints follow a Gaussian distribution. we posit that Gaussian Mixture Models and Hidden Markov Models with continuous emissions will be useful in clustering our timeseries gene expression values.
For both datasets, we observe that the median keeps shifting to the left for later timepoints. This is due to the infection spreading with the passage of time, more viral translation leads to an overall decrease in gene expression.
To standardise the medians for all timepoints, we again follow a standard practice in bioinformatics  we mediannormalise each timepoint (subtract the median from every value) so that each timepoint has the same median (0).
After mediannormalisation, the distribution of the maximum and minimum values are generally similar for all time points.
4 Results
We apply the models from Section 2.1 to the datasets described in Section 3, and attempt to identify those groups of genes that are critical for viral translation.
There are two tasks we need to perform  firstly, identify an optimal , and secondly, decide the benchmark algorithm (the most optimal model) that we will compare all our other techniques with.
We evaluate the performance of the models through the methods described in Section 2.3 and interpret the significance of the results obtained.
We perform our analysis in a systematic way 

A priori, we know that the datasets will not contain more than 2030 clusters. We restrict our analysis for .

We apply modelbased clustering techniques on the complete datasets, and use the evaluation criteria to determine the optimal .

If the results are inconclusive, we implement feature engineering in Section 4.1 and rerun the modelbased clustering techniques on this modified data.

Attempt to determine through BIC/AIC.

If no optimal value is found, do clustering for specific (as determined by biologists, as in Figure 15).

Interpret results through GO analysis and/or other analyses.
4.1 Feature Engineering
Since this is timeseries data, we also quantify the change between successive timepoints.
From the RNASeq and RPF datasets, we find the amount of change between timepoints and add these as features to the original dataset. Formally known as fold change in bioinformatics, it is calculated as follows 

Calculate the of the gene expression value, cdRPKM, for each timepoint. Let this be .

Subtract (the value at timepoint 0) from each successive timepoint .

Add this as a new attribute to the RNASeq and RPF data.
For TE data, since the values are already log ratios, we merely subtract each timepoint’s value from the first timepoint.
Even with the new features, we might encounter problems due to noise. Fortunately, we have a biological approach of dealing with this.
When a cell is infected with a virus, the RNASeq does not change much (as it measures the total content of all genes) while those genes that help in virus production will get translated more, i.e. the RPF of these genes will increase. Since TE measures translational efficiency (), the TE of such genes will also keep increasing across timepoints.
We filter out those genes that do not change much at the RNASeq level. We retain genes which have foldChange values for each timepoint between and . More formally, if is the foldChange for , then for each (later on, we reduce this threshold further to [0.3,0.3] to check whether the GO terms we obtained are indeed important or not).
After this filtering, we are left with 5576 genes that are highly translated.
As we are interested in genes that are highly translated between timepoints, we apply hierarchical clustering only on the foldChange features in the TE dataset. We plot heatmaps for different linkages, and check the general trends to find the number of clusters. In a heatmap, the values of each gene are plotted according to the colour key (bottom left).
The values in the heatmaps are sorted accorded to how they have been clustered.
In our case, we find that complete linkage performs the best. We can identify 46 distinct clusters from the heatmap, and cut the dendogram accordingly (represented by the green line). The result is shown in Figure 15.
After applying hierarchical clustering and doing heatmap analysis on the filtered data, we infer that lies between 4 and 6. We use these values of for clustering for Step 5 of Procedure 4.
4.2 ModelBased Clustering Results
4.2.1 KMeans
We first apply KMeans on all the datasets to determine the optimal .
We initialise the algorithm using KMeans++ for optimal clusterings, and apply the Elbow method and BIC to evaluate our results. Results for RNASeq are given in Figure 16.
The results are summarised in Table 2. "" means inconclusive.
Data  Elbow Method  BIC 

RNASeq (5 points)    15* 
RPF (5 points)     
TE (5 points)  5  17* 
*For BIC, if there is no clear peak, we take the first local maxima as
From the plots, we observe that determining the optimal is difficult.
There is no clear ’elbow’ point, and the BIC keeps increasing with ; thus KMeans is not penalising enough. (Using AIC is futile, as the penalisation would be even lower in this case). The elbow plots and GO analysis do not mutually agree on a .
Despite the mixed results from the information criteria metrics, we went ahead and performed GO analysis as per the determined in Table 2. However, the GO results were also found to be inconclusive  there were multiple clusters associated with the ’viral process’ term, and the pvalue was quite high (i.e. the terms were not enriched).
Thus, there is no reliable estimate for . This is most likely due to 2 reasons 

Kmeans cannot capture temporal dependencies in data.

Data is too noisy.
Both can be solved through feature engineering  we add the changes between the timepoints as features.
We filter the data as described in Section 3, and using the heatmap results, perform KMeans clustering for 4,5 and 6 clusters on the foldChange TE data.
We plot heatmaps to observe the general trends of clusters from this heuristic method, and identify those clusters which have an increasing trend (the clusters are plotted top to bottom in the heatmap and demarcated by black lines).
4.2.2 Gaussian Mixture Models
As seen in the KMeans approach, it is futile to do GO analysis on complete datasets (especially for the noisy RNASeq data, as we will most likely observe no significant trends by merely tracking overall changes in gene expression).
To test this, we still apply GMMs on the complete TE data (which are most likely to show certain trends due to genes being translated). As expected, we have no definitive results. The GO is unclear, and the BIC keeps increasing with for all covariance settings, offering no clear peak. The plots are in Appendix A, Section A.2.
We decide to apply GMMs on the featureengineered data from Section 4.1. We use the full covariance matrix and do BIC analysis for .
For filtered TE data, we conclude that the ideal number of clusters are from Figure 18. We plot the heatmap and general foldChange graph (in Section A.2.2) for this clustering to check the general trend as per the heatmap (clusters are demarcated by the green line).
We see in Figure 19 that Cluster 1 has an increasing trend. The GO analysis of this cluster is similar to that obtained through KMeans  we again observe terms related to GTPases.
We still perform GMMbased clustering for for evaluating GMMs performance against the benchmark algorithm in Section 4.3.
4.2.3 HMMbased Clustering
We speculated that HMMbased clustering might perform better than other algorithms on the complete and unfiltered datasets, as HMMs inherently take the time dependencies into account.
We assume hidden states for simplicity, and apply HMMbased clustering on all the datasets. The plots are in Section A.4.1, while the results are summarised in Table 3.
Data  BIC 

RNASeq (5 points)  13* 
RPF (5 points)  8* 
TE (5 points)  16* 
*For BIC, if there is no clear peak, we take the first local maxima as
Again, we observe no clear peaks. The GO results for the local peaks are inconclusive, although some of the terminologies in the GO for RPF and TE do include GTPases and other related terms. However, there were no other clear trends.
We can deduce from BIC’s inability to determine an optimal that we may be estimating too many parameters and unnecessarily complicating the models (Occam’s Razor).
We proceed to the next steps of the analysis. We apply HMMbased clustering on the featureengineered data for (we skip 6 as HMMs take the longest time to train) and generate heatmaps in Figure 20. We then check the heatmaps for upregulated clusters, and perform GO analysis on the Clusters 2 and 1 respectively from Figure 20.
The trend graphs of each cluster are given in Section A.4.2.
The GO results from HMMs are comparable to those from GMMs and KMeans, and we again observe GTPases in the gene ontology of the upregulated clusters.
4.2.4 Bayesian Networks
A limitation of Bayesian networks is that the current implementation is only applicable to discrete random variables (or discrete features).
Although there is some literature that deals with hybrid Bayesian Networks that model data containing both discrete and continuous attributes [cobb2007bayesian], data with purely continuous variables are difficult to model using Bayesian Networks.
Thus, as a first step, we discretise our gene expression data [pham2009unsupervised].
According to [pham2009unsupervised], there is no conclusive way of determining for CL Multinets based clustering, so we skip AIC/BIC analysis here. Instead, we perform clustering using to determine the structure of the underlying Bayesian network for the complete datasets.
The heatmap results for the TE dataset showed some increasing trends (covered in Section 29; the GO analysis for the complete TE dataset included GTPases in more abundance than HMMbased clustering.
Moreover, the underlying structure for each tree (determined by the clustering, for different ) unfailingly resulted in the same line graph (in Figure 21), with the timepoints in order (timepoint 1, followed by timepoint 2, and so on). This, along with the GO results, shows that Bayesian networks cluster even noisy data with a certain level of accuracy.
We then apply Bayesian networks to the feature engineered TE dataset. The expression values range from 2 to 4, and we choose 100 bins between 2 and 4, given by and then digitise each observation according to the bins it belongs to.
The heatmaps for 6 clusters are given in Figure 22. The blue lines demarcate the clusters.
Although the heatmaps are not as uniform and do not display concrete trends (as those from KMeans clustering) the GO results are reliable and contain more GTPaserelated terms, along with some new terms related to viral translation (e.g. G2/M transition of mitotic cell [davy2007g2]).
4.3 Performance Evaluation
We compare algorithms against a benchmark for the feature engineered TE dataset (with the foldChange features).
Bayesian networks consider the inherent timedependency between attributes, resulting in a graph with the timepoints in correct sequence each time.
The GO results are also significant as they capture those genes that are upregulated yet also distinct from each other despite having similar expression values (unlike KMeans, which clusters according to Euclidean distance between absolute attribute values).
We pick Bayesian networks as our benchmark algorithm, and treat the results obtained from it as the ground truth.
We use the ME distance metric to compare the other techniques against Bayesian networks for . For ME Distance, the closer it is to 0, the better the clustering is. We calculate the accuracy as 100(1)%.
Accuracy for Clusters  

Algorithm  4  5  6 
KMeans  56.52%  48.35%  41.61% 
GMMs  58.40%  53.11%  50.31% 
HMMs  68.72%  79.45%   
*compared to Bayesian networks
4.4 Biological Results
Existing literature on picornavirus (which the EVA71 virus is a type of) suggests that the proteins in the virus interact with certain GTPases and disrupt normal cellular functions (like nuclear transport of other viral and nuclear proteins). [porter2006picornavirus].
Looking more closely at the collection of heatmaps obtained by all the algorithms, we identify all those clusters that are upregulated and contain terms containing/related to GTPases in their GO analysis.
To be more rigorous about the terms we observe, we also reduce the filter to and check the GO analysis of the upregulated clusters. For each cluster, we only consider enriched GO terms (those with low pvalues, mentioned in Section 2.3.4).
Throughout our (repeated) experiments, we find that terms related to GTPases from Biological Processes (BP) and Molecular Functions (MF) keep reappearing. The terms of interest are 

Positive regulation of GTPase activity

Activation of GTPase activity

Regulation of small GTPase mediated signal transduction

Positive regulation of transcription, DNA templated

Signal transduction
Through the principle of guiltbyassociation mentioned in Section 4, we also identify other enriched terms that are of interest 

Viral Process

DNA repair

Negative regulation of transcription from RNA pol II promoter

Establishment of cell polarity

Smoothened signaling pathway

Positive regulation of dendrite development

G2/M transition of mitotic cell
Some GO terms (viral process, regulation of small GTPase mediated signal transduction, DNA repair) are observed across multiple clusters in the HMMs and Bayesian Networks analysis. This suggests interactions between the clusters too.
From the GO analysis, we find plenty of evidence from our datasets to support our hypothesis that a certain class/member of GTPase is important for establishing this infection.
5 Conclusion
This thesis explored the application of unsupervised learning techniques on viral genomic data  more specifically, human cells infected with the EVA71 virus (which causes HFMD). We attempted to determine host factors important to the propagation of this virus, and identified several key biological terms based on our clustering results.
5.1 Summary
As described in Section 1.3.1, we collected RNASeq and RPF gene expression data at different timepoints. In Figure 2, we saw that the space occupied by the virus increases with time. Initial analysis of the complete datasets (RNASeq, RPF, TE) were consistent with this trend. Next, we attempted to determine groups of genes crucial to viral translation. Initial clustering results contained some relevant GO terms (e.g. viral process, DNA repair) that appeared in multiple clusters; however, the trend graphs and heatmaps proved to be inconclusive, suggesting that our datasets were too noisy, containing many points insignificant to our analysis.
We then focused on those genes which were highly translated at the RPF level but not at the RNASeq level. Collating data from all our GO analyses, we identified the following terms  Positive regulation of GTPase activity, Activation of GTPase activity, Regulation of small GTPase mediated signal transduction, Positive regulation of transcription, DNA templated and Signal transduction.
These terms imply that some proteins in the class GTPase are definitely important to viral translation, possibly increasing the production of viral genes at key phases.
Following up on these results, we conduct more extensive literature reviews to verify the link between small GTpases and picornavirus infection.
5.1.1 Biological Experiments
We also design experiments to verify whether GTPases are relevant to the EVA71 virus.
The first experiment tracks changes in GTPase activity throughout infection, focusing more on Rho GTPases. For GTPase proteinspecific activity, there are also targeted studies to check for specificity of GAP/GEF, which are related to activation of GTPases.
The second experiment inhibits specific GTPases to check their effect on viral cells. 3 kinds of drugs targeting different GTPases are used 

Specific Rho drugs e.g. Rhosin or C3 transferase against RhoA/B/C, but not cdc42/Rac1

Specific cdc42 drugs e.g. Casin, ML141

Specific Rho kinase inhibitors e.g. Rhodblock series
Thus, the appearance of terms related to viral process in the GO analysis for all methods suggests that our clustering was useful. The abundance of GTPaserelated terms also support our hypothesis that they are indeed vital to the propagation of the virus inside the cell.
6 Future Work
6.1 Algorithms
6.1.1 Tuning
We can improve the performance of our current modelbased clustering techniques by incorporating additional data or adjusting the hyperparameters for the prior distributions we use for initialisation. We can consider discretising the data for Hidden Markov Model based clustering
[zhao2010hmm]; conversely, we can consider continuous variables for Bayesian networks [hofmann1996discovering].6.1.2 Different Algorithms
We can consider selforganizing maps (SOMs) for clustering gene expression data too
[tamayo1999interpreting]. SOMs impose partial structure on the clusters (in contrast to the rigid structure of hierarchical clustering, the strong prior hypotheses used in Bayesian clustering, and the nonstructure of kmeans clustering) and facilitate easy visualization and interpretation. SOMs have good computational properties and are easy to implement, reasonably fast, and scalable to large data sets too.We can also consider soft clustering, or fuzzy clustering; where one gene may belong to more than one cluster. Existing fuzzyclustering techniques can indeed be modified for short timeseries gene expression data [moller2003fuzzy].
Appendix A Appendix
a.1 KMeans
a.1.1 BIC Plots
a.2 Gaussian Mixture Models
a.2.1 BIC Plots
a.2.2 Trend Plots
a.3 Bayesian Networks
a.3.1 Heatmaps for Complete Datasets
The clusters are demarcated by red lines.
a.4 HMMbased Clustering
a.4.1 BIC Plots
a.4.2 Trend plots
List of Figures
 1 Central dogma of Molecular Biology
 2 EVA71 (chrE) infection over time*
 3 EVA71 life cycle
 4 Mixture of 3 Gaussians in one dimension
 5 Different covariance settings for the 2D Iris dataset
 6 Firstorder Markov Chain
 7 Hidden Markov Model
 8 Hierarchical Clustering Dendogram for SingleLinkage
 9 Elbow method for kmeans clustering
 10 Sample GO for a cluster
 11 RNASeq Data
 12 TE Data
 13 Distribution of values of RNASeq data
 (a) 0 h.p.i
 (b) 2 h.p.i
 (c) 4 h.p.i
 (d) 6 h.p.i
 (e) 8 h.p.i
 14 Distribution of values of RPF data
 (a) 0 h.p.i
 (b) 2 h.p.i
 (c) 4 h.p.i
 (d) 6 h.p.i
 (e) 8 h.p.i
 15 Heatmap with Complete Linkage Hierarchical Clustering for Filtered Data
 16 Cluster Analysis for RNASeq
 (a) Elbow Method
 (b) BIC Curve
 17 Heatmaps for KMeans clustering on TE data (top to bottom)
 (a) 4 clusters
 (b) 5 clusters
 (c) 6 clusters
 18 BIC for full covariance matrix, TE data
 19 Heatmap for 2 clusters, GMM clustering on TE data
 20 HMM Heatmap results on featureengineered data
 (a) Heatmap for 4 clusters
 (b) Heatmap for 5 clusters
 21 Tree Structure
 22 CL Multinets Heatmap results on featureengineered data
 (a) Heatmap for 4 clusters
 (b) Heatmap for 5 clusters
 (c) Heatmap for 6 clusters
 23 Heatmap from KMeans, RNASeq filter [0.5, 0.5]
 24 Heatmap from KMeans, RNASeq filter [0.3, 0.3]
 25 Cluster Analysis for RPF
 (a) Elbow Method
 (b) BIC Curve
 26 Cluster Analysis for TE
 (a) Elbow Method
 (b) BIC Curve
 27 BIC for different covariance settings in GMMs, TE
 (a) Spherical
 (b) Diagonal
 (c) Tied
 (d) Full
 28 Trend plots for foldChange values, 2 clusters, GMM
 (a) Cluster 1
 (b) Cluster 2
 29 Heatmaps for foldChange TE value
 (a) 4 clusters
 (b) 5 clusters
 (c) 6 clusters
 30 BIC results for HMMbased clustering
 (a) RNASeq data
 (b) RPF Data
 (c) TE Data
 31 Trend plots for foldChange values, 4 clusters, HMMbased clustering
 (a) Cluster 1
 (b) Cluster 2
 (c) Cluster 3
 (d) Cluster 4
 32 Trend plots for foldChange values, 5 clusters, HMMbased clustering
 (a) Cluster 1
 (b) Cluster 2
 (c) Cluster 3
 (d) Cluster 4
 (e) Cluster 5